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COMPUTATIONAL METHODS AND SYSTEMS FOR 
MULTIDIMENSIONAL ANALYSIS 

This application claims priority from provisional 
5 application serial numbers 60/466,010, 60/466,011 and 
60/466,012, all filed on April 28, 2003, all of which are 
incorporated herein in their entirety. This application 
also claims priority from United States application 
serial no. 10/689,313 filed on October 20, 2003, the 
10 entire contents of which are also incorporated by 
reference herein. 

BACKGROUND OF THE INVENTION 
1. Field of the Invention 

The present invention relates to chemical analysis 
15 systems. More particularly, it relates to systems that 
' are useful for the analysis of complex mixtures of 
molecules, including large organic molecules such as 
proteins, environmental pollutants, and petrochemical 
compounds, to methods of analysis used therein, and to a 
20 computer program product having computer code embodied 
therein for causing a computer, or a computer and a mass 
spectrometer in combination, to affect such analysis. 
Still more particularly, it relates to such systems tbat 
have mass spectrometer portions. 

25 2. Prior Art 

The race to map the human genome in the past several 
years has created a new scientific field and industry 
named genomics, which studies DNA sequences to search for 
genes and gene mutations that are responsible for genetic 
30 diseases through their expressions in messenger RNAs 
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(inRWA) and the subsequent coding of peptides which gxve 
rise to proteins. It has been well established in the 
field that, while the genes are at the root of many 
diseases including many forms of cancers, the proteins to 
wnich these genes translate are the ones that carry out 
the real biological functions. *he identification and 
quantification of these proteins and their interactions 
thus serve as the key to the understanding of disease 
states and the development of new therapeutics. It xs 
therefore not surprising to see the rapid shift in both 
the commercial investment and academic research from 
genes (genomics) to proteins (proteomics) , after the 
successful completion of the human genome project and the 
identification of some 35,000 human genes in the summer 
of 2000. Different from genomics, which has a more 
definable end for each species, proteomics is much more 
open-ended as any change in gene expression level, 
environmental factors, and protein-protein interactions 
can contribute to protein variations. In addition, the 
) genetic makeup of an individual is relatively stable 
whereas the protein expressions can be much more dynamxc 
depending on various disease states and many other 
factors, in this -post genomics era,- the challenges are 
to analyse the complex proteins (i.e., the proteome) 
5 expressed by an organism in tissues, cells, or other 
biological samples to aid in the understanding of the 
complex cellular pathways, networks, and -modules" under 
various physiological conditions. *he identification and 
quantitation of the proteins expressed in both normal and 
K> diseased states plays a critical role in the discovery of 
biomarkers or target proteins. 
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T *e challenges presented by the fast-developing field of 
proteomics have brought an ruapressive array of highly 
sophisticated scientific instrumentation to bear, from 
sample preparation, sample separation, imaging, isotope 
5 labeling, to mass spectral detection. Large data arrays 
of higher and higher dimensions are being routinely 
generated in both industry and academia around the world 
in the race to reap the fruits of genomics and 
proteomics. Due to the complexities and the sheer number 
10 of proteins (easily reaching into thousands) typically 
involved in proteomics studies, complicated, lengthy, and 
painstaking physical separations are performed in order 
to identify and sometime quantify individual proteins in 
a complex sample. These physical separations create 
15 tremendous challenges for sample handling and information 
tracking, not to mention the days, weeks, and even months 
it typically takes to fully elucidate the content of a 
single sample. 

While there are only about 35,000 genes in the human 
20 genome, there are an estimated 500,000 to 2,000,000 
proteins in human proteome that could be studied both. for 
general population and for individuals under treatment or 
other clinical conditions. A typical sample taken from 
cells, blood, or urine, for example, usually contains up 
25 to several thousand different proteins in vastly 
different abundances. Over the past decade, the industry 
has popularised a process that includes multiple stages 
in order to analyze the many proteins existing in a 
sample. This process is summarised in Table 1 with the 
30 following notable features: 
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Table 1. A Typical Proteomics Process: Time, Cost, and Informatics Needs 




" PMiomrsP™**. I - - - 


Sample 
collection 


- Isolate proteins from biological samples sucn as blood, tissue, 
urine, etc. 

■ Instrument cost minimal; Time: 1-3 hours 
« Mostly liquid phase sample 

- Need to track sample source/preparation conditions 




- Separate proteins spatially through gel electrophoresis to generate 
up to several thousand protein spots 

■ Instrument cost $ 1 50K; Time: 24 hours 

- Liquid into solid phase 

■ Need to track protein separation conditions and gel calibration 
information __ — . 7- 


Gel separation 


Imaging 
and 
spot cutting 


. Ttnacn> aMfwfi identify protein, spots on the gel wrtn Mw/pi 

cahbration, and spot cutting. 
• Instrument cost: $ 1 50K; Time: 30 sec/spot 
» Solid phase 

» Track protein spot images, image processing parameters, gel 
calibration parameters, molecular weights (MW) and pi's, and 
cutting records . _____ 


Protein 
digestion 


■ Chemically break down proteins into peptides 
• mstrument cost: $50K; Time: 3 hours 

■ Solid to liquid phase 

« Track digestion chemistry & reactioi^ndjtmns . 


Protein Spotting 
or 
Sample 
preparation 


- Mix each digested sample with mass spectral matrix, spul ^ 
sample targets, and dry (MALDI) or sample preparation for 
LAJMb[fivia) 

• Instrument cost: $50K; Time: 30 sec/spot 

« Liquid to solid phase 

■ Track volumes & concentrations for samples/reagents 


Mass spectral 
analysis 


- Measure peptide(s) in each gel spot directly (MALDI) or via 

LC/MS(/MS) -_„/„«_+ ..„ MATDT or 
• Instrument: $200K-650K; Time: 1-10 sec/spot on MALUl or 
30 min/spot on LC/MS(/MS) 

- Solid phase on MALDI or liquid phase on LCMS(/MS) 

- Track mass spectrometer operation, analysis, and peak processing 


Protein 
database search 


" V~Search private/public protein databases to identify protems based 
on unique peptides 
■ Instrument cost: miriimal; Time: 1-60 sec/spot: _ _ 


Summary 


* Ir<strument cost: $600K.-$ 1M 

j^jfi^samplKseyraald^ _ - > 
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It could take up to several days? or weeks or even 
months to complete the analysis of a single sample. 

b The bulky hardware system costs $600,000 to 0UC 

with significant operating (labor and consumables), 
maintenance, and lab space cost associated with it. 

c. This is an extremely tedious and complex process 

that includes several different robots and a few 
different types of instruments to essentially separate 
one liquid sample into hundreds to thousands of 
individual solid spots, each of which needs to be 
analyzed one-at-a-time through another cycle of solxd- 
liquid-solid chemical processing. 

d it is not a small challenge to integrate these 

pieces/steps together for a rapidly changing industry, 
and as a result, there is not yet a commercial system 
that fully integrates and automates all these steps. 
) Consequently, this process is fraught with human as well 
as machine errors. 

e. This process also calls for sample and data 

tracking from all the steps along the way - not a 



25 



challenge even for today' s informatics. 



f . Even for a fully automated process with a complete 

sample and data tracking informatics system, it is not 
clear how these data ought to be managed, navigated, and 
30 most importantly, analyzed. 

g At this early stage of proteomics, many 

researchers are content with qualitative identification 
5 
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of proteins. The holy grail of proteoses is, however, 
both identification and quantification, which would open 
doors to exciting applications not only in the area of 
biomarker identification for the purpose of drug 
discovery bat also for clinical diagnostics, as evidenced 
by the intense interest generated from a recent 
publication (Pertricoin, B. F. HI et al., Lancet, 
Vol. 359, pp. 573-77, (2002) > on using protein profiles 
from blood samples for ovarian cancer diagnostics. The 
current process cannot be easily adapted for quantitative 
analysis due to the protein loss, sample contamination, 
or lack of gel solubility, although attempts have been 
made for quantitative proteomics with the use of complex 
chemical processes such as ICAT (isotope-coded affinity 
tags); a general approach to quantitation wherein 
proteins or protein digests from two different sample 
sources are labeled by a pair of isotope atoms, and 
subsequently mixed in one mass spectrometry analysis 
(Gygi, S. P. et al. Nat. Bxotechnol. 17, 994-999 (1999)). 

' isotope-coded affinity tags <ICA*) is a commercialized 
version of the approach introduced recently by the 
Applied Biosystems of Foster City, California. In this 
technique, proteins from two different cell pools are 

5 labeled with regular reagent (light) and deuterium 
substituted reagent (heavy), and combined into one 
mixture. Mter trypsin digestion, the combined digest 
mixtures are subjected to the separation by biotin- 
affinity chromatography to result in a cysteine- 

0 containing peptide mixture. This mixture is further 
separated by reverse phase BPLC and analyzed by data 
dependent mass spectrometry followed by database search. 
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^is me thod significantly simplifies a complex peptxde 
mixture into a cysteine-containing peptide mixture and 
allows simultaneous protein identification by SEQUKS* 
database search and quantitation by the ratio of light 
5 peptides to heavy peptides. Similar to LC/LC/MS/MS, ICAT 
also circumvents insolubility problem, since both 

techniques digest whole protein mixture into peptxde 

fragments before separation and analysis. 

10 While very powerful, ICAT technique requires a multi-step 
process for labeling and pre- separation process, 
resulting in the loss of low abundant proteins with added 
reagent cost and further reducing the throughput for the 
already slow proteomic analysis. Since only cystexne- 
15 containing peptides are analyzed, the sequence coverage 
is typically quite low with ICAT. As is the case xn 
typical LC/MS/MS experiment, the protein identification 
is achieved through the limited, number of MS/MS analysis 
on hopefully signature peptides, resulting in only one 
20 and at most a few labeled peptides for ratio 
quantitation . 



Liquid chromatography interfaced with ta 
spectrometry (LC/MS/MS) has become a method of choice for 
protein sequencing (Yates Jr. et 1., Anal. Chem. 67, 
1426-1436 (1995». This method involves a few processes 
including digestion of proteins, LC separation of peptide 
mixtures generated from the protein digests, MS /MS 
analysis of resulted peptides, and database search for 
protein identification. The key to effectively identxfy 
proteins with LC/MS/MS is to produce as many high qualxty 
MS /MS spectra as possible to allow for reliable matching 
during database search. This is achieved by a data- 
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dependent scanning technique in a quadruple or an xon 
trap instrument. With this technique, the mass 
spectrometer checks the intensities and signal to noxse 
ratios of the most abundant ion(s) in a full scan MS 
5 spectrum and perform MS/MS experiments when the 
intensities and signal to noise ratios of the most 
abundant ions exceed a preset threshold. Usually the 
three most abundant ions are selected for the product ion 
scans to maximise the sequence information and minimize 
10 the time required, as the selection of more than three 
ions for MS /MS experiments would possibly result xn 
.issing other qualified peptides currently eluting from 
the LC to the mass spectrometer. 

15 The success of LC/MS/MS for identification of proteins is 
larg . Ly due to its many outstanding analytical 
characteristics. Firstly, it is a quite robust technique 
with excellent reproducibility. It has been demonstrated 
that it is reliable for high throughput LC/MS/MS analysis 

20 for protein identification. Secondly, when usxng 

- +-.he» technique delivers quality 

nanospray xonxzatxon, tne ceraui.Mi 

MS/MS spectra of peptides at sub-fentamole levels. 
Thirdly, the MS /MS spectra carry sequence information of 
b oth C-terminal and ^terminal ions. This valuable 



OOV.U . 

information can be used not only for identification of 
proteins, but also for pinpointing What post 
translations! modifications (FM have occurred to the 
protein and at which amino acid reside the MM take 
place . 

For the total protein digest from an organism, a cell 
line or a tissue type, LC/MS/MS alone is not sufficient 
to produce enough number of good quality MS/MS spectra 
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for the identification of the proteins. Therefore, 
LC/MS/MS is usually employed to analyze digests of a 
single protein or a single mixture of proteins, such as 
the proteins separated by two dimensional electrophoresis 
5 (2DE), adding a minimum of a few days to the total 
analysis time, to the instrument and equipment cost, and 
to the complexity of sample handling and the informatics 
need for sample tracking. While a full MS scan can and 
typically do contain rich information about the sample, 
10 the current LC/MS/MS methodology relies On the MS/MS 
analysis that can be afforded for only a few ions in the 
full MS scan. Moreover, electrospray ionization (ESI) 
used in LC/MS/MS has less tolerance towards salt 
concentrations from the sample, requiring rigorous sample 
15 clean up steps. 

^Identification of the proteins in an organism, a cell 
line, and a tissue type is an extremely challenging task, 
due to the sheer number of proteins in these systems 
20 (estimated at thousands or tens of thousands). The 
development of LC/LC/MS/MS technology (Link, A. J. et al. 
Nat Biotechnol. 17, 676-682 (1999); Washburn, K. P- et 
al, Nat. Biotechnol. 19, 242-247 (2001)) is one attempt 
to meet this challenge by going after one extra dimension 
2.5 of LC separation. This approach begins with the 
digestion of the whole protein mixture and employs a 
strong cation exchange (SCX) LC to separate protein 
digests by a stepped gradient of salt concentrations. 
This separation usually takes 10-20 steps to turn an 
30 extremely complex protein mixture into a relatively 
simplified mixture. The mixtures eluted from the SCX 
column are further introduced into a reverse phase LC and 
subsequently analyzed by mass spectrometry. This method 
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has been demonstrated to identify a large number of 
proteins from yeast and the microsome of human myeloid 
leukemia cells. 

One of the obvious advantages of this technique is that 
it avoids insolubility problems in 2DE, as all the 
proteins are digested into peptide fragments which are 
usually much more soluble than proteins. As a result, 
more proteins can be detected and wider dynamic range 
achieved with LC/LC/MS/MS. Another advantage is that 
chromatographic resolution increases tremendously through 
the extensive 2D LC separation so that more high quality 
MS/MS spectra of peptides can be generated for more 
complete and reliable protein identification. The third 
advantage is that this approach is readily automated 
within the framework of current LC/MS system for 
potentially high throughput proteomic analysis. 

*he extensive 2D LC separation in LC/LC/MS/MS, however, 
) could take 1-2 days to complete. In addition, this 
technique alone is not able to provide quantitative 
information of the proteins identified and a quantitative 
scheme such as ICA2 would require extra time and effort 
with sample loss and extra complications. In spite of 
5 the extensive 2D LC separation, there are still a 
significant number of peptide ions not selected for MS/MS 
experiments due to the time constraint between the MS/MS 
data acquisition and the continuous LC station, resulting 
in low sequence coverage (25% coverage is considered as 
50 very good already). m±l & recent development in 
depositing LC traces onto a solid support for later MS/MS 
analysis can potentially address the l^ited MS /MS 
coverage, issue, it would introduce significantly more 
10 
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sample Handling and protein loss and further complicate 
the sample tracking and information management tasks. 

Matrix-Assisted Laser Desorption Ionization (MALDX) 
5 utilises a focused laser beam to irradiate the target 
sample that is co-crystalized with a matrix compound on a 
conductive sample plate. The ionized molecules are 
usually detected by a time of flight (TOT) »ass 
spectrometer, due to their shared characteristics as 
10 pulsed techniques. 

MALDI/TOF is commonly used to detect 2DE separated intact 
proteins because of its excellent speed, high 
sensitivity, wide mass range, . high resolution, and 
15 contaminant-forgivingness. MALDI/TOF with capabilities 
of delay extraction and reflecting ion optics can achieve 
impressive mass accuracy at 1-10 ppm and mass resolution 
with m/Am at 10000-15000 for the accurate analysis of 
peptides. However, the lack of MS /MS capability in 
20 M&LDI/TOF is one of the major limitations for its use in 
proteomics applications. Post Source Decay (PSD) in 
MALDI/TOF does generate sequence-like MS/MS information 
for peptides, but the operation of PSD often is not as 
robust as that of a triple guadrupole or an ion trap mass 
25 spectrometer. Furthermore, PSD data acquisition is 
difficult to automate as it can be peptide-dependent . 

The newly developed MftlDI TOF/TOF system (Rejtar, T. et 
al., J". Proteomr. Kes. 1(2) 171-179 <2002)) delivers many 
30 attractive features. She system consists of two TOFs and 
a collision cell, which is similar to the configuration 
of a tandem guadrupole system. The first TOF is used to 
select precursor ions that undergo collisional induced 
11 
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dissociation (CID) in the cell to generate fragment ions. 
Subsequently, the fragment ions are detected by the 
second HOW . One of the attractive features is that 
TOF/TOF is able to perform as many data dependent MS/MS 
experiments as necessary, while a typical LC/MS/MS system 
selects only a few abundant ions for the experiments. 
This unique development makes it possible for TOF/TOF to 
perform industry scale proteomic analysis. The proposed 
solution is to collect fractions from 2D IC experiments 
and spot the fractions onto an MALDI plate for MS /MS . As 
a result, more MS /MS spectra can be acquired for more 
reliable protein identification by database search as the 
quality of MS /MS spectra generated by high-energy CID in 
TOF/TOF is far better than PSD spectra. 

The major drawback for this approach is the high cost of 
the instrument ($750,000), the lengthy 2D separations, 
the sample handling complexities with LC fractions, the 
cumbersome sample preparation processes for MODI, the 

, intrinsic difficulty in quantification with NUDI, and 
the huge informatics challenges for data and sample 
tracking. Due to the LC separation and the sample 
preparation time required, the analysis of several 
hundred proteins in one sample would take at least * 

5 days. 

It is well recognized that Fourier-Transform Ion- 
Cyclotron Resonance (RXGB) MS is a powerful technique 
that can deliver high sensitivity, high mass resolution, 
0 wide mass range, and high mass accuracy. Recently, 
FTICR/MS coupled with LC showed impressive capabilities 
for proteomic analysis through Accurate Mass Tags (UB) 
(Smith, R. D. et al, Proteoses, 2, 513-523 (2002)). AMT 
12 
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5 



10 



is such an accurate m/s value of a peptide that can be 
used to exclusively identify a protein. It has been 
demonstrated that, using the Jiff approach, a single 
IiC /FTICR-MS analysis can potentially identify more than 
10 s proteins with mass accuracy of better than 1 PP*w 
Nonetheless, ATM alone may not be sufficient to pinpoint 
amino acid residue specific post-translational 
modifications of peptides. In addition, the instrument 
is prohibitively expensive at a cost of $7S0K or more 
with high maintenance requirements. 

Protein arrays and protein chips are emerging 
technologies (Issaq, H. J- t al, Biochem Blophys Kes 
292(3), 587-592 (2002)) similar in the design 
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concept to the oligonucleotide-chip used in 
expression profiling. Protein arrays consist of protein 
chips which contain chemically (cationic, anionic, 
hydrophobic, hydrophilic, etc.) or biochemically 
(antibody, receptor, Dim, etc.) treated surfaces for 
specific interaction with the proteins of interest. 
These technologies take advantages of the specificity 
provided by affinity chemistry and the high sensitivity 
of MADLI/TOF and offer high throughput detection of 
proteins. In a typical protein array experiment, a large 
number of protein samples can be simultaneously applied 
to an array of chips treated, with specific surface 
chemistries. By washing away undesired chemical and 
biomolecular background, the proteins of interest are 
docked on the chips due to affinity capturing and hence 
"purified". Further analysis of individual chip by 
MM.DI-T0F results in the protein profiles in the samples. 
These technologies are ideal for the investigation of 
protein-protein interactions, since proteins can be used 
13 
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as affinity reagents to treat the surface to monitor 
their interaction with other specific proteins. Another 
useful application of these technologies is to generate 
comparative patterns between normal and diseased tissue 
5 samples as a potential tool for disease diagnostics. 

Due to the complicated surface chemistries involved and 
the added complications with proteins or other protein- 
like binding agents such as denaturing, folding, and 
10 solubility issues, protein arrays and chips are not 
expected to have as wide an application as gene chips or 
gene expression arrays. 

Shus, the past 100 years have witnessed tremendous 
15 strides made on the MS instrumentation with many 
different types of instruments designed and built for 
high throughput, high resolution, and high sensitivity 
work. The instrumentation has been developed to a stage 
where single ion detection can be routinely accomplished 
20 on most commercial MS systems with unit mass resolution 
allowing for the observation of ion fragments coming from 
different isotopes. In stark contrast to the 
sophistication in hardware, very little has been done to 
systematically and effectively analyze the massive amount 
25 of MS data generated by modern MS instrumentation. 

in a typical mass spectrometer, the user is usually 
required or supplied with a standard material having 
several fragment ions covering the mass spectral m/z 
30 range of interest. Subject to baseline effects, isotope 
interferences, mass resolution, and resolution dependence 
on mass, peak positions of a few ion fragments are 
determined either in terms of centroids or peak maxima 
14 
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through a low order polynomial fit at the peak top. 
These peak positions are then fit to the known peak 
positions for these ions through either 1 st or other 
higher order polynomial fit to calibrate the mass <■/») 
5 axis. 

After the mass axis calibration, a typical mass spectral 
data trace would then be subjected to peak analysis where 
peaks (ions) are identified. This peak detection routine 
10 is a highly empirical and compounded process where peak 
shoulders, noise in data trace, baselines due to chemical 
backgrounds or contamination, isotope peak interferences, 
etc., are considered. 

15 For the peaks identified, a process called centroiding is 
typically applied to attempt to calculate the integrated 
peak areas and peak positions. Due to the many 
interfering factors outlined above and the intrinsxc 
difficulties in determining peak areas in the presence of 

20 other peaks and/or baselines, this is a process plagued 
by many adjustable parameters that can make an isotope 
peak appear or disappear with no objective measures of 
the centroiding quality. 

25 Thus, despite their apparent sophistication current 
approaches have several pronounced disadvantages. These 
include: 

Lack of Mass Accuracy. The mass calibration currently in 
30 use usually does not provide better than 0.1 amu (m/z 
unit) in mass determination accuracy on a conventional MS 
system with unit mass resolution (ability to visualise 
the presence or absence of a significant isotope peak) . 
15 
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in order to achieve higher mass accuracy and reduce 
ambiguity in molecular fingerprinting such as peptide 
mapping for protein identification, one has to switch to 
an MS system with higher resolution such as guadrupole 
5 TOF (gTOF) or FT ICR MS which come at significantly 
higher cost. 

Large Peak Integration Error. Due to the contribution of 
mass spectral peak shape, its variability, the isotope 
10 peaks, the baseline and other background signals, and the 
random noise, current peak area integration has large 
errors (both systematic and random errors) for either 
strong or weak mass spectral peaks. 

15 Difficulties with Isotope Peaks. Current approach does 
not have a good way to separate the contributions from 
various isotopes which usually give out partially 
overlapped mass spectral peaks on conventional MS systems 
with unit mass resolution. The empirical approaches used 

20 either ignore the contributions from neighboring isotope 
peaks or over-estimate them, resulting in errors for 
dominating isotope peaks and large biases for weak 
isotope peaks or even complete ignorance of the weaker 
peaks. When ions of multiple charges are concerned, the 

25 situation becomes worse even, due to the now reduced 
separation in mass unit between neighboring isotope 
peaks. 

Nonlinear Operation. The current approaches use a multi- 
30 stage disjointed process with many empirically adjustable 
parameters during each stage. Systematic errors (biases) 
are generated at each stage and propagated down to the 
later stages in an uncontrolled, unpredictable, and 
16 
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nonlinear manner, making it impossible for the algorithms 
to report meaningful statistics as measures of data 
processing quality and reliability. 

5 Dominating Systematic Errors. In most of MS applications, 
ranging from industrial process control and environmental 
monitoring to protein identification or biomarker 
discovery, instrument sensitivity or detection limit has 
always been a focus and great efforts have been made ,n 

10 many instrument systems to minimise measurement error or 
noise contribution in the signal. Unfortunately, the 
peak processing approaches currently in use create a 
source of systematic error even larger than the random 
noise in the raw data, thus becoming the limiting factor 

15 in instrument sensitivity or reliability. 

Mathematical and Statistical Inconsistency. The many 
empirical approaches used currently make the whole mass 
spectral peak processing inconsistent either 
20 mathematically or statistically. The peak processing 
results can change dramatically on slightly different 
data without' any random noise or on the same synthetic 
data with slightly different noise. In order words, the 
results of the peak processing are not robust and can be 
25 unstable depending on the particular experiment or data 
collection . 

Instrument-To-Instrument Variations. It has usually been 
difficult to directly compare raw mass spectral data from 
30 different MS instruments due to variations in the 
mechanical, electromagnetic, or environmental tolerances. 
With the current ad hoc peak processing applied on the 
raw data, it only adds to the difficulty of 
17 
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quantitatively comparing results from different MS 
instruments. On the other hand, the need for comparing 
either raw mass spectral data directly or peak processing 
results from different instruments or different types of 
5 instruments has been increasingly heightened for the 
purpose of impurity detection or protein identification 
through the searches in established MS libraries. 

A second order instrument generates a matrix of data for 
10 each sample and can have a higher analytical power than 
first order instruments if the data matrix is properly 
structured. The most widely used proteomics instrument, 
LC/MS, is a typical example of second order instrument 
capable of potentially much higher analytical power than 
15 what is currently achieved. Other second order 

proteomics instruments include LC/LC with single UV 
wavelength detection, ID gel with MALDX-TOF MS detection, 
ID protein arrays with MMJOX MS detection, etc. 

20 Two-dimensional gel electrophoresis (2D gel) has been 
widely used in the separation of proteins in complex 
biological samples such as cells or urines. Typically 
the spots formed by the proteins are stained with silver 
for easy identification with visible imaging systems. 

25 These spots are subsequently excised, dissolved/digested 
with ensymes, transported onto M&XDI targets, dried, and 
analysed for peptide signatures using MffiDI time-of- 
f light mass spectrometer. 

Several complications arise from this process: 

30 1. The protein spots are not guaranteed to contain only 
single proteins, especially at extreme ends of the 
separation parameters (pi for charge or m for molecular 
18 
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weight) . This usually makes peptide searching difficult 
if not impossible. Additional liquid chromatography, 
separation may be required for each excised spot, which 
further slows down the analysis. 
5 2. She conversion of biological sample from liquid phase 
to solid phase (on the gel), back into liquid phase (for 
digestion) , and finally into solid phase again (for MftlDI 
TOF analysis) is a very cumbersome proeess prone to 
errors, carry-overs, and contaminations. 

10 3. Due to the sample conversion processes involved and the 
£aat the M&LDI-TOF ^reproducibility in sampling and 
ionization, this analysis has been widely recognized as 
only qualitative and not quantitative. 

Thus, in spite of its tremendous potential and clear 
15 advantages over first and seroth order analysis, second 
order instrument and analysis have so far been limited to 
academic research where the sample is composed of a few 
synthetic analytes with no sign of commercialization. 
There are several barriers that must be crossed in order 
20 for this approach to reach its huge potential. These 
include: 

a. in seeond order protein analysis, it is even more 
. important to use raw profile MS scans instead of the 
25 centroid data currently used in virtually all MS 
applications. To maintain the bilinear data structure, 
successive MS scans of a particular ion eluting from LC 
needs to have the same mass spectral peak shape 
(obviously at different peak heights) , a critical second 
30 order structure destroyed by centroiding and de-isotoping 
(summing all isotope peaks into one integrated area). 

19 



WO 2004/097582 



PCT/US2004/013097 



The sticks from centroiding data appear at different mass 
locations (up to 0.5amu error) from successive MS scans 
of the same ion. 

b. Higher order instrument and analysis requires more 
robust instrument and measurement process and artifacts 
such as shifts in one or two of the dimensions can 
severely compromise the quantitative and even the 
qualitative results of the analysis (Wang, T. et al, 
Anal. Chem. 63, 2750 (1991); Wang, Y. et al, Anal. Chem., 
65, 1174 (1993); Kiers, H. A. L. et al, J- ChemomBtxxcs 
13, 275 (1999)), in spite of the recent progress made in 
academia (Bro, R. et al, J. ehemometrics 13, 295 (1999)). 
Other artifacts such as non-linearity or non-bilinearity 
could also lead to complications (Wang, ¥. et al, J- 
Chmnoxaatrlcs, 7, 439 (1993)). Standardization and 
algorithmic corrections need to be developed in order to 
maintain the bilinearity of second order proteomics data. 

c. in many MS instruments such as quadrupole MS, the 
mass spectral scan time is not negligible compared to the 
protein or peptide elution time. Therefore, a 

significant skew would exist where the ions measured in 
one mass spectral scan comes from different time points 
during the LC elution, similar to what has been reported 
for GC/MS (Stein, S. E. et al, J . Mm. Soc. Mass Sp&ctrom. 
5, 859 (1994)). 

Thus, there exists a significant gap between where the 
. pxoteomics research would like to be and where it is at 
the present. 
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SUMMARY OF THE INVENTION 



It is an object of the invention to provide a chemical 
analysis system, which may include a mass spectrometer, 
and a method for operating a chemical analysis system 
5 that overcomes the disadvantages described above. 

It is another object of the invention to provide a 
storage media having thereon computer readable program 
code for causing a chemical analysis, including a 
chemical analysis system having a mass spectrometer, 
10 system to perform the method in accordance with the 
invention. 

These objects and others are achieved in accordance with 
a first aspect of the invention by using 2D gel imaging 
data acquired from intact proteins to perform both 
15 qualitative and quantitative analysis without the use of 
ffi ass spectrometer in the presence of protein spot 
overlaps. In addition the invention facilitates direct 
quantitative comparisons between many different samples 
collected over either a wider population range (diseased 
20 and healthy), over a period of time on the same 
population (development of disease), and over different 
treatment methods (response to potential treatment), etc. 
The gel spot alignment and matching are automatically 
built into the data analysis to yield the best overall 
25 results. The approach in accordance with the invention 
represents a fast, inexpensive, quantitative, and 
qualitative tool for both protein identification and 
protein expression analysis. 
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Generally, the invention is directed to a method for 
analyzing data obtained from at least one sample in a 
separation system that has a capability for separating 
components of a sample containing more than one component 
5 as a function of at least two different variables, the 
method comprising obtaining data representative of the at 
least one sample from the system, the data being 
expressed as a function of the two variables; forming a 
data stack having successive levels, each level 
10 containing successive data representative of the at least 
one sample; forming a data array representative of a 
compilation of all of the data in the data stack; and 
separating the data array into a series of matrixes, the 
matrixes being: a concentration matrix representative of 
15 concentration of each component in the sample; a first 
profile of the components as a function of a first of the 
variables; and a second profile of the components as a 
function of a second of the variables. There may be 
only one, or a single sample, and the successive data is 
20 representative of the sample as a function of time. 
Successive data may be representative of the single 
sample as a function of mass of its components. 
Alternatively, there may be a plurality of samples, and 
the successive data is then representative of successive 
25 samples . 

The invention is more specifically directed to a method 
for analyzing data obtained from multiple samples in a 
separation system that has a capability for separating 
components of a sample containing more than one component 
30 as a function of two different variables, the method 
comprising obtaining data representative of multiple 
samples from the system, the data being expressed as a 
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function of the two variables; forming a data stack 
having successive levels, each level containing one of 
the data samples; forming a data array representative of a 
compilation of all of the. data in the data stack; and 
5 separating the data array into a series of matrixes, the 
matrixes being: a concentration matrix representative of 
concentration of each component in the sample; a first 
profile of the components as a function of the first 
variable; and a second profile of the components as a 
10 function of the second variable. The first profile and 
the second profile are representative of profiles of 
substantially pure components. The method further 
comprises performing qualitative analysis using at least 
one of the first profile and the second profile. 
15 The method may further comprise standardizing data 
representative of a sample by performing a data matrix 
multiplication of such data into the product of a first 
standardization matrix, the data itself, and a second 
standardization matrix, to form a standardized data 
20 matrix. Terms in the first standardization matrix and the 
second standardization matrix may have values that cause 
the data to be represented at positions with respect to 
the two variables, which are different in the 
standardized data matrix from those in the data array. 
25 The first standardization matrix shifts the data with 
respect to the first variable, and the second 
standardization matrix shifts the data with respect to 
the second variable. Terms in the first standardization 
matrix and the second standardization matrix have values 
30 that serve to standardize distribution shapes of the data 
with respect to the first and second variable, 
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respectively. Terms in the first standardization matrix 
and the second standardization matrix may be determined 
by applying a sample having known components to the 
apparatus; and selecting terms for the first 
standardization matrix and the second standardized 
matrix which cause data produced by the known components 
to be positioned properly with respect to the first 
variable and the second variable. The terms may be 
determined by selecting terms which produce a. smallest 
error in position of the data with respect to the first 
variable and the second variable in the standardized data 
matrix. The terms of the first standardization matrix 
and the second standardization matrix are preferably 
computed for each sample, and so as to produce a smallest 
error over all samples. At least one of the first and 
second standardization matrices can be simplified to be 
either a diagonal matrix or an identity matrix. The terms 
in the first standardization matrix and the second 
standardization matrix may be based on parameterized 
known functional dependence of the terms on the 
variables . 

Values of terms in the first standardization matrix and 
the second standardization matrix are determined by 
solving the data array R: 
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where Q (m x k) contains pure profiles of all * components 
with respect to the first variable, W (n x Jc) contains 
pure profiles with respect to the second variable for the 
components, C (P * « contains concentrations of these 
5 components in all p samples, 1 is a new data array with 
scalars on its super-diagonal as the only nonzero 
elements, and E (m x n x p) is a residual data array. 

The sepatation apparatus may be a two-dimensional 
10 electrophoresis separation system, wherein the first 
variable is isoelectric point and the second variable is 
molecular weight. 

The variables may be a result of any combination, in no 
particular sequence, and including self -combination , of 
15 chromatographic separation, capillary electrophoresis 
separation, gel-based separation, affinity separation and 
antibody separation. 

The two variables may be mass associated with the mass 
axis of a mass spectrometer. 
20 The apparatus may further comprise a chromatography 
system for providing the samples to the mass 
spectrometer, retention time being another of the two 
variables . 

The apparatus may further comprise an electrophoresis 
25 separation system for providing the samples to the mass 
spectrometer, migration characteristics of the sample 
being another of the two variables. 

in the method the data is preferably continuum mass 
spectral data. Preferably, the data is used without 
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centroiding. The data may be corrected for time slcew. 
Preferably, a calibration of the data with respect to 
xnass and mass spectral peak shapes is performed. 
One of the first variable and the second variable may be 
5 that of a region on a protein chip having a plurality of 
protein affinity regions. 

The method may further comprise obtaining data for the 
data array by using a single channel analyser and by 
analysing the samples successively. The single channel 

10 detector may be based on one of light absorption, light 
emission, light reflection, light transmission, light 
scattering, refractive index, electrochemistry, 
conductivity, radioactivity, or any combination thereof. 
The components in the sample may be bound to at least one 

15 of fluorescence tags, isotope tags, stains, affinity 
tags, or antibody tags. 

The invention is also directed to a computer readable 
medium having thereon computer readable code for use with 
a chemical analysis system having a data analysis portion 
for analyzing data obtained from multiple samples, the 
chemical analysis system having a separation portion that 
has a capability for separating components of a sample 
containing more than one component as a function of two 
different variables, the computer readable code being for 
causing the computer to perform a method comprising 
obtaining data representative of multiple samples from 
the system, the data being expressed as a function of the 
two variables; forming a data stack having successive 
levels, each level containing one of the data samples; 
forming a data array representative of a compilation of 
26 
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all of the data in the data stack; and separating the 
data array into a series of matrixes, the matrixes being: 
a concentration matrix representative of concentration of 
each component in the sample; a first profile of the 
components as a function of the first variable; and a 
second profile of the components as a function of the 
second variable. The computer readable medium may 
further comprise computer readable code for causing the 
computer to analyze data by performing the steps of any 
one of the methods stated above. 

The invention is further directed to a chemical analysis 
system for analyzing data obtained from multiple samples, 
the system having a separation system that has a 
capability for separating components of a sample 
containing more than one component as a function of two 
different variables, the system having apparatus for 
performing a method comprising obtaining data 
representative of multiple samples from the system, the 
data being expressed as a function of the two variables; 
forming a data stack having successive levels, each level 
containing one of the data samples; forming a data array 
representative of a compilation of all of the data in the 
data stack; and separating the data array into a series 
of matrixes, the matrixes being: a concentration matrix 
representative of concentration of each component in the 
sample; a first profile of the components as a function 
of the first variable; and a second profile of the 
components as a function of the second variable. The 
chemical analysis system may have facilities for 
performing the steps of any of the methods described 
above. 
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The invention further includes a method for analyzing 
data obtained from a sample in a separation system that 
has a capability for separating components of a sample 
containing more than one component, the method comprising 
separating the sample with respect to at least a first 
variable to form a separated sample; separating the 
separated sample with respect to at least a second 
variable to form a further separated sample; obtaining 
data representative of the further separated sample from 
a multi-channel analyzer, the data being expressed as a 
function of three variables; forming a data stack having 
successive levels, each level containing data from one 
channel of the multi-channel analyzer; forming a data 
array representative of a compilation of all of the data 
in the data stack; and separating the data array into a 
series of matrixes or arrays, the matrixes or arrays 
being: a concentration data array representative of 
concentration of each component in the sample on its 
super-diagonal; a first profile of each component as a 
function of a first variable; a second profile of each 
component as a function of a second variable; and a third 
profile of each component as a function of a third 
variable. The first profile, the second profile, and the 
third profile are representative of profiles of 
substantially pure components. The method further 
comprises performing qualitative analysis using at least 
one of the first profile, the second profile, and the 
third profile. 

The method further comprises standardising data 
representative of a sample by performing a data matrix 
multiplication of such data into the product of a first 
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standardization matrix, the data itself, and a second 
standardization matrix, to form a standardized data 
matrix. Terms in the first standardization matrix and the 
second standardization matrix have values that cause the 
5 data to be represented at positions with respect to two 
of the three variables, which are different in the 
standardized data matrix from those in the data array. 
The first standardization matrix shifts the data with 
respect to one of the two variables, and the second 
10 standardization matrix shifts the data with respect to 
the other of the two variables. Terms in the first 
standardization matrix and the second standardization 
matrix may have values that serve to standardize 
distribution shapes of the data with respect to the the 
15 two variables, respectively. Terms in the first 
standardization matrix and the second standardization 
matrix are determined by applying a sample having known 
components to the apparatus; and selecting terms for the 
first standardization matrix and the second 
20 standardization matrix which cause data produced by the 
Known components to be positioned properly with respect 
to the two variables. 

The terms are determined by selecting terms that produce 
a smallest error in position of the data with respect to 
25 the two variables, in the standardized data matrix. The 
terms of the first standardization matrix and the second 
standardization matrix may be computed for a single 
channel. The terms of the first standardization matrix 
and the second standardization matrix are computed so as 
30 to produce a smallest error for the channel. 
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At least one of the first and second standardization 
matrices can be simplified to be either a diagonal matrix 
or an identity matrix. Preferably, the terms in the 
first standardization matrix and the second 
5 standardization matrix are based on parameterized known 
functional dependence of the terms on the variables: 

in accordance with the invention, the values of terms in 
the first standardization matrix and in the second 
standardization matrix are determined by solving data 
10 array R: 




where Q (mxJrJ contains pure profiles of all * components 
with respect to the first variable, W C* * « contains 
pure profiles with respect to the second variable for the 
15 components, C <P * W contains pure profiles of these 
components with respect to the multichannel detector or 
the third variable, I (k x k x k) is a new data array 
with scalars on its super-diagonal as the only nonzero 
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elements representing the concentrations of all the 
components, and E <m x n x p) is a residual data array. 



The separation apparatus used may be a oi 
electrophoresis separation system, wherein the variable 
5 is one of isoelectric point and molecular weight. 

The two separation variables may be a result of any 
combination, in no particular sequence, and including 
self -combination, of chromatographic separation, 
capillary electrophoresis separation, gel-based 
10 separation, affinity separation and antibody separation 

One of the three variables may be mass associated with 
the mass axis of a mass spectrometer. 

The apparatus used may comprise at least one 
chromatography system for providing the separated samples 

15 to the mass spectrometer, retention time being at least 
one of the variables. The apparatus may also comprise at 
least one electrophoresis separation system for providing 
the separated samples to the mass spectrometer, migration 
characteristics of the sample being at least one of the 

20 variables. Preferably, the data is continuum mass 
spectral data. Preferably the data is used without 
centroiding. 

The method may further comprise correcting the data for 
time skew. The method also may further comprise 
25 performing a calibration of the data with respect to mass 
and spectral peak shapes. 
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The apparatus used may comprise a protein chip having a 
plurality of protein affinity regions, location of a 
region being one of the three variables. 

The multi-channel analyzer used may be based on one of 
5 light absorption, light emission, light reflection, light 
transmission, light scattering, refractive index, 
electrochemistry, conductivity, radioactivity, or any 
combination thereof. The components in the sample may be 
bound to at least one of fluorescence tags, isotope tags, 
10 stains, affinity tags, or antibody tags. 

The apparatus used may comprise a two-dimensional 
electrophoresis separation system, wherein a first of the 
at least one variable is isoelectric point and a second 
of the at least one variable is molecular weight. 

15 The invention is also directed to a computer readable 
medium having thereon computer readable code for use with, 
a chemical analysis system having a data analysis portion 
for analyzing data obtained from a sample, the chemical 
analysis system having a separation portion that has a 
20 capability for separating components of a sample 
containing more than one component as a function of at 
least one variable, the computer readable code being for 
causing the computer to perform a method comprising 
separating the sample with respect to at least a first 
25 variable to form a separated sample; separating the 
separated sample with respect to at least a second 
variable to form a further separated sample; obtaining 
data representative of the further separated sample from 
a multi-channel analyzer, the data being expressed as a 
30 function of three variables; forming a data stack having 
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successive levels, each level containing data from one 
channel of the multi-channel analyzer; forming a data 
array representative of a compilation of all of the data 
in the data stack; and separating the data array, into a 
5 series of matrixes or arrays, the matrixes or arrays 
being: a concentration data array representative of 
concentration of each component in the sample on its 
super-diagonal; a first profile of each component as a 
function of a first variable; a second profile of each 
10 component as a function of a second variable; and a third 
profile of each component as a function of a third 
variable. The computer readable medium may further 
comprise computer readable code for causing the computer 
to analyze data by performing the steps of any of the 
15 methods set forth above. 

The invention is also directed to a chemical analysis 
system for analyzing data obtained from a sample, the 
System having a separation system that has a capability 
for separating components of a sample containing more 
20 than one component as a function of at least one 
variable, the system having apparatus for performing a 
method comprising separating the sample with respect to 
at least a first variable to form a separated sample; 
separating the separated sample with respect to at least 
25 a second variable to form a further separated sample; 
obtaining data representative of the further separated 
sample f rom . a multi-channel analyzer, the data being 
expressed as a function of three variables; forming 
data stack having successive levels, each level 
30 containing data from one channel of the multi-channel 
analyzer; forming a data array representative of a 
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compilation of all of the data in the data stack; and 
separating the data array into a series of matrixes or 
arrays, the matrixes or arrays being: a concentration 
data array representative of concentration of each 

5 component in the sample on its super-diagonal; a first 
profile of each component as a function of a first 
variable; a second profile of each component as a 
function of a second variable; and a third profile of 
each component as a function of a third variable. The 

10 chemical analysis system may further comprise facilities 
for performing the steps of the methods described above. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The foregoing aspects and other features of the present 
invention are explained in the following description, 
15 taken in connection with the accompanying drawings, 
wherein like numerals indicate like components, and 
wherein: 

Fig. 1 is a block diagram of an analysis system in 
accordance with the invention, including a mass 
20 spectrometer . 

Fig. 2 is a block diagram of a system having one 
dimensional sample separation, and a multi-channel 
detector . 

Fig. 3 is a block diagram of a system having two 
25 dimensional sample separation, and a single channel 
detector . 
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Fig. tt, Fig. 4B and Fig. 4C illustrate the compilation 
of three-dimensional data arrays based on two -dimensional 
measurements, in accordance with the invention. 
Fig 5 illustrates a three dimensional data array based 
5 on single three-dimensional measurements with one sample. 
Fig 6 illustrates a three-dimensional data array based 
on two-dimensional liquid phase separation followed by 
mass spectral detection. 

Fig . 7 illustrates time skew correction for multi-channel 
[0 detection with sequential scanning. 

Fig. 8 is a flow chart of a method of analysis in 
accordance with the invention. 

Pig . 9 illustrates a transformation for automatic 
alignment of separation axes and corresponding profiles, 
15 in accordance with the invention. 

Fig. 10 illustrates direct decomposition of a three- 
dimensional data array. 

Fig 11 illustrates grouping of peptides (a dendrogram) 
resulting from enzymatic digestion into proteins through 
20 cluster analysis, in accordance with the invention. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT 

Referring to Fig. 1, there is shown a block diagram of an 
analysis system 10, that may be used to analyse protexns 
or other molecules, as noted above, incorporating 
25 features of the present invention. Although the present 
invention will be described with reference to the single 
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embodiment shown in the drawings, it should be understood 
that the present invention can be embodied in many 
alternate forms of embodiments. In addition, any 
suitable types of components could be used. 
Analysis system 10 has a sample preparation portion 12, a 
mass spectrometer portion 14, a data analysis system 16, 
and a computer system 18. The sample preparation portion 
12 may include a sample introduction unit 20, of the type 
that introduces a sample containing molecules of interest 
to system 10, such as Finnegan LCQ Deca XP Max, 
manufactured by Thermo Electron Corporation of Waltham, 
MA, USA. The sample preparation portion 12 may also 
include an analyte separation unit 22, which is used to 
perform a preliminary separation of analytes, such as the 
proteins to be analyzed by system 10. Analyte separation 
unit 22 may be any one of a chromatography column, a gel 
separation unit, such as is manufactured by Bio-Rad 
Laboratories, Inc. of Hercules, CA, and is well known in 
the art. In general, a voltage or PH gradient is applied 
to the gel to cause the molecules such as proteins to be 
separated as a function of one variable, such as 
migration speed through a capillary tube (molecular 
weight, Mff) and isoelectric focusing point (Hannesh, S. 
M. , Electrophoresis 21, 1202-1209 (2000)) for one 
dimensional separation or by more than one of these 
variables such as by isoelectric focusing and by MW (two 
dimensional separation). An example of the latter is 
known as SDS-PAGE. 

The mass separation portion 14 may be a conventional mass 
spectrometer and may be any one available, but is 
preferably one of MALDI-TOF, quadrupole MS, ion trap MS, 
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or FTICR-MS. If it has a MMJ3I or electrospray 
ionization ion source, such ion source may also provide 
for sample input to the mass spectrometer portion 14. In 
general, mass spectrometer portion 14 may include an ion 
source 24, a mass spectrum analyser 26 for separating 
ions generated by ion source 24 by mass to charge ratio 
(or simply called mass) , an ion detector portion 28 for 
detecting the ions from mass spectrum analyzer 26, and a 
vacuum system 30 for maintaining a sufficient vacuum for 
mass spectrometer portion 14 to operate efficiently. If 
mass spectrometer portion 14 is an ion mobility 
spectrometer, generally no vacuum system is needed. 
The data analysis system 16 includes a data acquisition 
portion 32, which may include one or a series of analog 
to digital converters (not shown) for converting signals 
from ion detector portion 28 into digital data. *his 
digital data is provided to a real time data processing 
portion 34, which process the digital data through 
operations such as summing and/or averaging. A post 
processing portion 36 may be used to do additional 
processing of the data from real time data processing 
portion 34, including library searches, data storage and 
data reporting. 

Computer system 18 provides control of sample preparation 
portion 12, mass spectrometer portion 14, and data 
analysis system 16, in the manner described below. 
Computer system 18 may have a conventional computer 
monitor 40 to allow for the entry of data on appropriate 
screen displays, and for the display of the results of 
the analyses performed. Computer system 18 may be based 
on any appropriate personal computer, operating for 
37 



WO 2004/097582 



PCT/US2004/013097 



15 



20 



18 



example with a Windows® or UNIX® operating system, or any 
other appropriate operating' system. Computer system 18 
will typically have a hard drive 42, on which the 
operating system and the program for performing the data 
analysis described below is stored. A drive 44 for 
accepting a CD or floppy disk is used to load the program 
in accordance with the invention on to computer system 
The program for controlling sample preparation 
portion 12 and mass spectrometer portion 14 will 
typically be downloaded as firmware for these portions of 
system 10. Data analysis system 16 may be a program 
written to implement the processing steps discussed 
below, in any of several programming languages such as 
C++, JAVA or Visual Basic. 

Fig, 2 is a block diagram of an analysis system 50 
wherein the sample preparation portion 12 includes a 
sample introduction unit 20 and a one dimensional sample 
separation apparatus 52. By way of example, apparatus 52 
may be a one dimensional electrophoresis apparatus. 
Separated sample components are analyzed by a multi- 
channel detection apparatus 54, such as, for example a 
series of ultraviolet sensors, or a mass spectrometer. 
The manner in which data analysis may be conducted is 
discussed below. 

25 Fig. 3 is a block diagram of an analysis system 60, 
wherein the sample preparation portion 12 includes a 
sample introduction unit 20 and a first dimension sample 
separation apparatus 62 and a second dimension sample 
separation apparatus 64. By way of example, first 
30 dimension sample separation apparatus 62 and second 
dimension sample separation apparatus 64 may be two 
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successive and different liquid chromatography units, or 
may be consolidated as a two-dimensional electrophoresis 
apparatus. Separated sample components are analysed by a 
single channel detection apparatus 66, such as, for 
5 example a ultraviolet sensor with a 245nm bandpass 
filter, or a gray scale gel imager. Again, the manner in 
which data analysis may be conducted is discussed below. 

Fig. 4A illustrates a three-dimensional data array 70 
compiled from a series of two-dimensional arrays 72A to 

10 72N, representative of successive samples of a mixture of 
components to be analyzed. Two dimensional data arrays 
72A to 72N may be produced by, for example, two 
dimensional gel electrophoresis, or successive 
chromatographic separations, as described above with 

15 respect to Fig. 3, or the combination of other separation 
techniques . 

Fig. 4B illustrates a three-dimensional data array 74 
compiled from a series of two-dimensional arrays 76A to 
76N, representative of successive samples of a mixture of 

20 components to be analyzed. Two dimensional data arrays 
72A to 72N may be produced by, for example, one 
dimensional gel electrophoresis, or liquid 
chromatography, followed by multi-channel analysis, as 
described above with respect to Fig. 2, or by other 

25 techniques such as gas chromatography/ infrared 
spectroscopy (GC/IR) or I»C/Eluorescence . 

Fig. 4C illustrates a three-dimensional data array 78 
compiled from a series of two-dimensional arrays 80A to 
SON, representative of successive samples of a mixture of 
30 components to be analyzed. Two dimensional data arrays 

39 



WO 2004/097582 



PCT/US2004/013097 



72A to 72N are produced fay, for example, protein affinity 
chips which are afale to selectively bind proteins to 
defined regions (spots) on their surfaces of the type 
sold by Ciphergen Biosystems, Inc. of Fremont, 
5 California, USA, followed fay mult i -channel analysis, such 
as Surface Enhanced Laser Desorption/lonissation (SELDD 
time of flight mass spectrometry, which may be one of the 
systems, as described above with respect to Fig. 2. Other 
techniques which may be used are ID protein array 
10 combined with multi-channel fluorescence detection. 

Fig. 5 illustrates a three-dimensional data array 82 
compiled from a series of two-dimensional arrays 84A to 
84N, representative of a single sample of a mixture of 
components to be analyzed. Two dimensional data arrays 
,5 84A to 84N may be produced by, for example, two- 
dimensional gel electrophoresis, or successive liquid 
chromatography, as described above with respect to Fig. 
1. Multi-channel detection by, for example mass 
spectrometry, as described above with respect to Fig. 
20 l,that produces data in the third dimension. Other 
suitable techniques . are 2D LC with multi-channel UV or 
fluorescence detection, 2D LC with IR detection, 2D 
protein array with mass spectrometry. 

Fig. 6 illustrates a data array 84 obtained by two- 
25 dimensional liquid phase separation (for example strong 
cation exchange chromatography followed by reversed phase 
chromatography). The third dimension is represented by 
the data along a mass axis 86 from mass spectral 
detection. 
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20 



The data arrays of Figs. 4A, 4B, 4C, 5 and 6 contain 
terms representative of all components in all of the 
samples or of a single, as the case may be (including the 
components of any calibration standards) . 

Pig. 7 illustrates correction for time skew of the a 
scanning multi-channel detector connected to a time-based 
separation, as is the case in LC/MS where the LC is 
connected to a mass spectrometer which sweeps through a 
certain mass range during a predetermined scanning time. 
"Phis type of time skew exists for most of mass 
spectrometers with the exception of simultaneous systems 
such as a magnetic sector system which detects ions of 
all masses simultaneously. Other examples include GC/IR 
where volatile compounds are separated in terms of 
retention time after passing through a column while IR 
spectrum is being acquired through either a scanning 
monochromator or an interferometer. When a time-dependent 
event such as a separation or reaction is connected to a 
detection system that sequentially scans through multiple 
channels, a time skew is generated where channels scanned 
earlier correspond to an earlier point in time for the 
event whereas the channels scanned later would correspond 
to a later point in time for the event. This time skew 
can be corrected by way of interpolation on a channel-by- 
channel basis to generate multi-channel data that 
correspond to the same point in time for all channels, 
i.e., to interpolate for each channel from the solid 
tilted lines onto the corresponding dashed horizontal 
lines in Fig. 7. 

Fig. 8 is a general flow chart of how sample data is 
acquired and processed in accordance with the invention. 
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Collection and processing of samples, such as biological 
samples, is performed at 100. If a single sample is being 
processed, three-dimensional data is acquired at 102. If 
two-dimensional data is to be acquired with multiple 
5 samples at 106, an internal standard is optionally added 
" to the sample at 104. As described with respect to any of 
the techniques and systems above, a three-dimensional 
data array is formed at 108. The three-dimensional data 
array undergoes direct decomposition at 110. Different 
10 paths are selected at 112 based on whether or not a two- 
dimensional measurement has been made. If two-d^nsional 
measurements have been made, pure analyte profiles in 
each dimension are obtained at 114 along with thexr 
relative concentrations across all samples. If three- 
15 dimensional measurements have been made on a single 
sample, pure analyte profiles for all analytes in the 
sample along all three dimensions are obtained at 116. In 
eitber case, data interpretation, including analyte 
grouping, cluster analysis and otber types of expression 
20 and analysis are conducted at 118 and the results are 
reported out on display 40 of computer system 18, 
associated with a system of one of Figs. 1, 2 or 3. 
The modes of analysis of the data are described below, 
with respect to specific examples, which are provided in 
25 order to facilitate understanding of, but not by way of 
limitation to, the scope of the invention. 
If the response matrix, % (• x *> , for a typical sample 
can be expressed in the following bilinear form: 
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where c A is the concentration of the xth analyte, x* (a * 
5 i) is the response of this analyte along the row axis 
(e.g., LC elution profile or chromatogram of this analyte 
in LC/MS) , y* in x 1) is the response of this analyte 
along the column axis (e.g., MS spectrum of this analyte 
in I.C/MS) , and k is the number of analytes in the sample. 
10 When the response matrices of multiple samples 
(j=*l,2,„.,p> are compiled, a 3D data array R {m x n x p) 
can be formed. 

Thus, at the end of a 2D gel run, a gray-scale image can 
be generated and represented in a 2D matrix % 
15 (dimensioned m by n, corresponding to m different pi 
values digitized into rows and n different m values 
digitised into columns, for sample 3) . This raw image 
data need to be calibrated in both pi and m axes to 
yield a standardized image Rj, 

20 & - *aW*l ; 

where % is a square matrix dimensioned as m by m with 
nonzero elements along and around the main diagonal (a 
banded diagonal matrix) and is another square matrix (n 
by n) with nonzero elements along and around the main 
25 diagonal (another banded diagonal matrix) . The matrices 
A, and Bj can be as simple as diagonal matrices 
(representing simple linear scaling) or as complex as 
increasing or decreasing bandwidths along the main 
diagonals (correcting for at least one of band shift, 
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broadening, and distortion or other types of non- 
linearity). A graphical representation of the above 
equation in its general form can be given as illustrated 
in Fig. 9: 



"# # 
« # 
# □ 
O # 
# # 




When 2-D gel data from multiple samples are collected, a 
set of Ri can be arranged to form a 3D data array R as 



10 where p is the number of biological samples and with R 
dimensioned as m by n by p. This data array (in the 
shape of a cube or rectangular solid) can be decomposed 
with trilinear decomposition method based on GRAM 
(Generalized Rank Annihilation Method, direct 
15 decomposition through matrix operations without 
iteration, Sanchez, E. et al, J. Chemometrics 4, 29 
(1990)) or PARAFAC (PARAllel FACtor analysis, iterative 
decomposition with alternating least squares, Carroll, J. 
et al, Psychometri&a 3, 45 (1980); Bezemer, E. et al, 
20 Anal. Chem. 73, 4403 (2001)) into four different arrays 
and a residual data array E; 
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where C represents the relative concentrations of all 
identifiable proteins (k of them with k<min(m,n>) in all 
p samples, Q represents the pi profiles digitized at m pi 
5 values for each protein (k of them), W represents the 
molecular weight profiles digitised at n values for each 
protein (ideally a single peak will be observed that 
corresponds to each protein) , and I is a new data cube 
with scalars on its super-diagonal as the only nonzero 
10 elements . 

When all proteins are distinct (with differing pi values 
and differing MW) with expression levels varying in a 
linearly independent fashion from sample to sample, the 
following direct interpretations of the results can be 
15 expected: 

l. The k value from the above decomposition automatically 
be equal to the number of proteins. 



45 



WO 2004/097582 



PCT/US2004/013097 



■2 Values in each row of matrix C, after scaling with the 
super-diagonal elements in X, represent the relative 
concentrations of these proteins in a particular sample. 

3. Each column in matrix Q represents the deconvolved pi 
5 profile of a particular protein. 

4. Each column in matrix W represents the deconvolved m 
profile of a particular protein. 

If these proteins are distinct but with correlated 
expression levels from sample to sample (matrix C with 
10 linearly dependent columns) , the interpretation can only 
be performed on the group of proteins having correlated 
expression levels, not on each individual proteins, a 
finding of significance for proteoaics research. 

Based on the decomposition presented above, the power of 
15 such multidimensional system and analysis can be 
immediately seen: 

a as a result of this decomposition that separates the 
composite responses into linear combinations of 
20 individual protein responses in each dimension, the 
quantitative information can be obtained for each protein 
in the presence of all other proteins. 

b. The decomposition also separates out the profiles 
for each individual protein in each dimension, providing 
qualitative information for the identification of these 
proteins in both dimensions (pi and MCT in 2DE and the 
chromatographic and the mass spectral dimension in 
LC/MS) . 



25 



30 
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c. Each sample in the 3D data array R can contain a 
different set of proteins, inlying that the proteins of 
interest can be identified and quantified in the presence 
of unknown proteins with only the common proteins, shared 
by all samples in the data array have all nonzero 
concentrations in the decomposed matrix C. 

d. A minimum of only two distinct samples will be 
required for this analysis, providing for a much better 
way to perform differential proteomic analysis without 
labels such as in ICAT to quickly and reliably pick out 
the proteins of interest in the presence of other un- 
interesting proteins. 

e. The number of analytes that can be analyzed is 
limited by the maximum allowable pseudo-rank for each 
response matrix R,, which can easily reach thousands (ion 
trap MS) to hundreds of thousands <TOF or FTICR-MS) , 
paving the way for large scale proteomic analysis on 
complex biological samples. 



f. A typical LC/MS run can be completed in less than 2 
hours with no other chemical processes or sample 
preparation steps involved, pointing to at least 10-fold 
gain in throughput and tremendous simplification in 
informatics . 



g 



Since full LC/MS data are used in the analysis, 
nearly 100% sequence coverage can be achieved without the 
MS /MS experiments. 
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to important advantage of the above analysis, based on an 
image of the 2-D gel separation is that it is non- 
destructive and one can follow up with further 
confirmation through the use of, for example, IBXtt TOF. 
5 The above analysis can also be applied to protein digests 
where all peptides from' the same protein can be treated 
as a distinct group for analysis and interpretation. The 
separation of pi and profiles into individual proteins 
can still be performed when separation into individual 
10 peptides is not feasible. 

Left and right transformation matrices A, and B t can be 
preferably determined using internal standards added to 
each sample. These internal standards are selected to 
cover all pi and m ranges, for example, five internal 
15 standards with one on each comer of the 2D gel image and 
one right in the center. The concentrations of these 
internal standards would vary from one sample to another 
so that the corresponding matrix C in the above 
decomposition can be partitioned as 

20 C = [CstCnfcl 

where all columns in C s are independent, i.e., C s is full 
rank, or better yet, the ratio between the largest and 
the smallest singular value is minimized. How with part 
of the matrix C known in the above decomposition, it is 

25 possible to perform the decomposition such that the 
transformation matrices % and B, for each sample 
(j«l,2,...p) can be determined in the same decomposition 
process to minimize the overall residual I. The scale of 
the problem can be drastically reduced by parameterizing 

30 the nonzero diagonal bands in A, and B Jr for example, by 
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specifying a band-broadening filter of Gaussian shape for 
each row in % and each column in B, and allowing for 
smooth variation of the Gaussian parameters down the rows 
in Aj and across the columns in B 5 . With matrices % and 

5 Bj properly parameterized and analytical forms of 
derivatives with respect to the parameters derived, an 
efficient Gauss-Newton iteration approach can be applied 
to the trilinear decomposition or PAKAFAC algorithm to 
arrive at both the desired decomposition and the proper 

10 transformation matrices A 3 and Bj for each sample. 

Compared with ICAT (isotope-coded affinity tags, Gygi, S. 
p efc a i f mature Biotech. 1999, 17, .994), this approach 
is not limited to analyzing only two samples and does not 
retire peptide sequencing for protein identifications. 
15 The number of samples that can be quantified can be in 
the hundreds to thousands or even tens of thousands and 
the protein identification can be accomplished through 
the mass spectral data alone once all these proteins have 
been mathematically resolved and separated. Furthermore, 
20 there is no additional chemistry involving isotope 
labels, which should reduce the risk of losing many 
important proteins during the tedious sample preparation 
stages required for ICAT. 

in brief, the present invention, using the method of 
25 analysis described above, provides a technique for 
protein identification and protein expression analysis 
using 2D data having the following features: 

- 2D gel data from multiple samples is used to form a 
3D data array; 

30 - for each of the following scenarios there will be a 

different set of interpretations applicable: 
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a) where all proteins are distinct with expression 
levels varying independently from sample to sample, 

b) where all proteins are distinct with correlated 
expression levels from sample to sample ; 

5 - avoids centroiding on mass spectral continuum data; 

- raw mass spectral data alone can be directly utilized 
and is sufficient as inputs into the data array 
decomposition ; 

- full mass spectral calibration, as for example that 
10 performed in United States patent application serial 

number 10/689,313, may be optionally performed on the raw 
continuum data to obtain fully calibrated continuum data 
as inputs to the analysis, allowing for even more 
accurate mass determination and library search for the 
15 purpose of protein identification once deconvolved mass 
spectrum becomes available for an individual protein 
after the array decomposition. 

- this approach is based on mathematics instead of 
physical sequencing to resolve and separate proteins and 

20 does not require peptide sequencing for protexn 
identifications , 

- the results are both qualitative and quantitative, 

- gel spot alignment and matching is automatically 
built into the data analysis. 

Furthermore/ it is preferred to have fully calibrated 
continuum mass spectral data in this invention to further 
improve mass alignment and spectral peak shape 
consistency, as described in co-pending application 
30 10/689,313, a brief summary of which is set forth below. 
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Producing Clt b^tari Continuum Mags Spectral Data 

A calibration relationship of toe form: 

5 m~f(m 0 ) (EquatioBA) 

can be established through a least- squares polynomial fit 
between the centroids measured and the centroids 
calculated using all clearly identifiable isotope 
clusters available in the mass spectral standard across 

10 the mass range. 

in addition to this simple mass calibration, additional 
full spectral calibration filters are calculated to serve 
two purposes simultaneously: the calibration of mass 

15 spectral peak shapes and mass spectral peak locations. 
Since the mass axis may have been pre-calibrated, the 
mass calibration part of the filter function is reduced 
in this case to achieve a further refinement on mass 
calibration, i.e., to account for any residual mass 

20 errors after the polynomial fit given by Equation A. 

This total calibration process applies easily to 
guadrupole-type MS including ion traps where mass 
spectral peak width (Full Width at Half Maximum or FWHM) 

25 is generally roughly consistent within the operating mass 
range. For other types of mass spectrometer systems such 
as magnetic sectors, TOF, or FTMS, the mass spectral peak 
shape is expected to vary with mass in a relationship 
dictated by the operating principle and/or the particular 

30 instrument design. While the same mass-dependent 
calibration procedure is still applicable, one may prefer 
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to perform the total calibration in a transformed data 
space consistent with a given relationship between the 
peak width/ location and mass. 

in the case o£ TOF, it is known that mass spectral peak 
width (FWBM) Dm is related to the mass M ^ *» 
following relationship: 



Am = a 4m 

where a is a known calibration coefficient. In other 
words, the peak width measured across the mass range 
would increase with the square root of the mass. With a 
square root transformation to convert the mass axis into 
15 a new function as follows: 



where the peak width (FWHM) as measured in the 
20 transformed mass axis is given by 

Am__a 

which will remain unchanged throughout the spectral 
range . 

^ For an FT MS instrument, on the other hand, the peak 
width (FWHM) Om will be directly proportional to the mass 
«, and therefore a logarithm transformation will be 
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where the peak width <FWH*0 as measured in the 
transformed log-space is given hy 



5 



Am 



10 



which will be fi*ed independent of the mass. Typically 
in FMS, Q»/» can be managed on the order of 10 , i.e., 
10= in terms of the resolving power m/Dm. 

' F or a magnetic sector instrument, depending on the 
specific design, the spectral peas width and the mass 
sampling interval usually follow a known mattematical 
U relationship with mass, which may lend itself a 
particular form of transformation through which the 
expected mass spectral peas width would become 
independent of mass, much lihe the way the square root 
and logarithm transformation do for the TOP and FTMS . 



20 



M the ejected mass spectral pea* width becomes 
independent of the mass, due either to the 
transformation such as logarithmic transformation on PTMS 
and square root transformation on TOP-HS or the intrinsic 
25 nature of a particular instrument such as a well designed 
and properly tuned guadrupcle or ion trap MS, huge 
savings in computational time will be achieved with a 
Single calibration filter applicable to the full mass 
spectral range. This would also simplify the requirement 
30 on the mass spectral calibration standard: a single mass 
spectral pea* would be required for *. calibration wrth 
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additional peafc(s) (if present) serving as check or 
confirmation only, paving the way for complete mass 
spectral calibration of each and every MS based on an 
internal standard added to each sample to he measured. 

There are usually two steps in achieving total mass 
spectral calibration . The first steps is to derive actual 
mass spectral peak shape functions and the second step is 
to convert the derive actual peak shape functions into a 
specified target peak shape functions centered at correct 
mass locations. An internal or external standard with its 
measured raw mass spectral continuum y« is related to the 
isotope distribution y of a standard ion or ion fragment 
by 

yo=y®p 

where p is the actual peak shape function to be 
calculated. This actual peak shape function is then 
converted to a specified target peak shape function t (a 
Gaussian of certain FWHM, for example) through one or 
more calibration filters given by 
t = p®f 

The calibration filters calculated above can be arranged 
into the following banded diagonal filter matrix: 



in which each short column vector on the diagonal, f*, is 
taken from the convolution filter calculated above for 
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the corresponding center mass. The elements in U is 
taken from the elements of the convolution filter xn 
reverse order , i.e., 



20 



fiftt 
fijn-l 

As an example, this calibration matrix fill have a 
dimension of 8,000 by 8,000 for a guadrupole MS with mass 
coverage up to 1,000 amu at 1/8 amu data spacing. Due to 
its sparse nature, however, typical storage requirement 
would only be around 40 by 8,000 with an effective filter 
length of 40 elements covering a 5-amu mass range. 

Returning to the present invention, further multivariate 
statistical analysis can be applied to matrix C to study 
and understand the relationships between different 
samples and different proteins. The samples and proteins 
can be grouped or cluster-analysed to see which proteins 
expressed more within what sample groups. For example, a 
dendrogram can be created using the scores or loadings 
from the principal component analysis of the C matrxx. 
Typical conclusions include that cell samples from 
healthy individuals clustered around each other whxle 
those from diseased individuals would cluster around xn a 
different group. For samples collected over a period of 
time after certain treatment, the samples may show a 
continuous change in the expression levels of some 
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proteins, indicating a biological reaction to the 
treatment on the protein level. For samples collected 
over a series of dosages, the changes in relevant 
proteins can indicate the effects of dosages on this set . 
5 of proteins and their potential regulations. 

in the case where proteins are pre-digested into peptides 
before the analysis, each column in matrix C would 
represent a linear combination of a group of peptides 
coming from the same protein or a group of proteins 
10 showing similar egression patterns from sample to 
sample. A dendrogram performed to classify columns in 
matrix C, such as the one shown in Fig. 11, would group 
individual peptides back into their respective proteins 
and thus accomplish the analysis on the proteome level. 

15 Qualitative (or signatory) information for the proteins 
identified can be found in pi profile matrix Q and m 
matrix W. The qualitative information can serve the 
purpose of protein identification and even library 

searching, especially if the molecular weight information 
20 is determined with sufficient accuracy. In summary, the 

three matrices C, Q, and W when combined, allow for both 
protein quantification and identification with automatic 

gel matching and spot alignment from the determination of 

transformation matrices represented by A> and B 3 . 

25 The above 2-D data can come in different forms and 
shapes. An alternative to MALDI-TOF after 
excising/digesting 2-D gel spots is to run these samples 
through conventional LC/MS, for example on the Thermal 
Finnigan LCQ system, to further separate proteins from 

30 each gel spot before MS analysis. A very important 
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application of this approach allows for rapid and direct 
protein identification and quantitation by avoiding 2-D 
gel (2DE) separation all together, thus increasing the 
throughput by orders of magnitude. This can be 
accomplished through the following steps: 

1. Directly digest the sample containing hundreds and tens 
of thousands of proteins without any separation 

2. Run the digested sample on a conventional LC/MS 
instrument to obtain a two-dimensional array. It should 
be noted that MS /MS capability is not a requirement in 
this case, although one may chose to run the sample on a 
LC/MS/MS system, which generates additional sequencing 
information. 

3. Repeat 1 and 2 for multiple samples to generate a 
three-dimensional data array. 

4. Decompose the data array using the approach outlined 
above . 

5. Replace the pi axis with LC retention time and the MKT 
axis with the mass axis in interpretation and mass 
spectral searching for the purpose of protein 
identification. The mathematically separated mass spectra 
can be further processed through centroxding and de- 
isotoping to yield stick spectra consistent with 
conventional databases and search engines such as Mascot 

SwissProt, available online from: 

http://www.matrixscience.com o r from 

http://us.expasy.org/sprot/. It is preferable, however, 
to fully calibrate the raw mass spectral continuum data 
into calibrated continuum data prior to the data array 
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decomposition to yield fully calibrated continue mass 
spectral data for each deconvolved protein or peptide. 
This continuum mass spectral data would then be used along 
with its high mass accuracy without centroiding for 
5 protein identification through a novel database search in 
a co-pending patent. 

Depending on the nature of the LC column, the LC can act 
as another form of charge separation, similar to the pi 
axis in 2-D gel. The mass spectrometer in this case 
10 serves as a precise means for molecular weight 
measurement, similar to the WM axis in 2-D gel analysis. 
Due to the high mass accuracy available on a mass 
spectrometer, the transformation matrix B 3 can be reduced 
to a diagonal matrix to correct for mass-dependent 
15 ionization efficiency changes or even an identity matrix 
to be dropped out of the equation, especially after the 
full mass spectral calibration mentioned above. In order 
to handle large protein molecules, the protein sample is 
typically pre-digested into peptides through the use of 
20 enzymatic or chemical reactions, for example, tripsin. 
Therefore, it is typical to see multiple LC peaks as well 
as multiple masses for each protein of interest. While 
this may add complexities for sample handling, it largely 
enhances the selectivity of library search and protein 
25 identification. Multiple digestions may be used to 
further enhance the selectivity. Taking this to the 
extreme, each protein may be digested into peptides of 
varying lengths beforehand (Erdman degradation) to yield 
complete protein sequence information from matrix W. 
30 This is a new technique for protein sequencing based on 
mathematics rather than physical sequencing as an 
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alternative to LC tandem mass spectrometry. In 
applications including MS, the approach does not require 
any data preprocessing on the continuum data from mass 
scans, such as centroiding and de-isotoping as are 
typically done in commercial instrumentation that are 
prone to many unsystematic errors. The raw counts data 
can be supplied and directly utilized as inputs into the 
data array decomposition. 

Other 2~D data that can yield similar results with 
identical approaches includes but is not limited to the 
following examples that have 2-D separation with single 
point detection, or 1-D separation with multi-channel 
detection, or 2-D multi-channel detection: 

1. Each 1-D or 2-D gel spot can be treated as an 
independent sample for the subsequent LC/MS analysis to 
generate one LC/MS 2-D data array for each spot and a data 
array containing all gel spots and their LC/MS data 
arrays. Due to the added resolving power gained from both 
gel and LC separation, more proteins can be more 
accurately identified. 

2. Other types of 2-D separation, such as 
pl/hydrophobicity, MW/hydrophobicity , or a 1-D separation 
using either pi, MST, or hydrophobic! ty and a form of 
multi-channel electromagnetic or mass spectral detection, 

25 such as 1-D gel combined with on-the-gel MfcLDI TOF, or 
LC/TOF, LC/W, LC/Fluorescence, etc. can be used. 

3. Other types of 2-D separations such as 2-D liquid 
chromatography, with a single-channel detection <UV at 
245nm or fluorescence-tagged to be measured at one 

30 wavelength) can be used. 
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4. ID or 2D protein arrays coupled with mass spectral or 
other multi-channel detection where each element on the 
array captures a particular combination of proteins in a 
way not dissimilar to LC columns can foe used. These ID or 
5 2D spots can be arranged into one dimension of the 2-D 
array with the other dimension being mass spectrometry. 
These protein spots are similar to sensor arrays such as 
Surface Acoustic Wave Sensors (SffiWs, coated with GC column 
materials to selectively bind to a certain class of 
10 compounds) or electronic noses such as conductive polymer 
arrays on which a binding event would generate a distinct 
electrical signal. 

5. Multi-wavelength emission and excitation fluorescence 
(EEM) on single sample with different proteins tagged 
15 differentially or specific to a segment of the protein 
sequence can be used. 

In second order proteomics analysis, the data array is 
formed by the 2D response matrices from multiple samples. 
Another effective way to create a data array is to 
20 include one more dimension in the measurement itself such 
that a data array can be generated from a single sample 
on what is called a third order instrument. One such 
instrument starting to receive wide attention in 
proteomics is LC/LC/MS, amenable to the same 
25 decomposition to yield mathematically separated elution 
profiles in both LC dimensions and MS spectral responses 
for each protein present in the sample. 

•Thus, while the two-dimensional approaches outlined above 
30 are major improvements in the art, a three-dimensional 
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approach has the advantages of being much, faster, more 
reproducible, and simplicity arising from the fact that 
the sample stays in the liquid phase throughout the 
entire process. However, since many proteins are too 
5 large for conventional mass spectrometers, and all 
proteins in the sample may be digested into peptide 
fragments before separation and mass spectral 

detection, the number of peptides and the complexity of 
the system increases by at least one order of magnitude. 
10 This results in what appears to be an insurmountable 
problem for data handling and data interpretation. In 
addition, available approaches stop short at only the 
level of qualitative protein identification for samples 
of very limited complexity such as yeast (Washburn, M. P. 
15 et al, Nat. Biotecftnol. 19, 242-247 (2001)). The 
approach presented below achieves both identification and 
quantification of anywhere from hundreds and up to tens 
of thousands of proteins in a single two-dimensional 
liquid chromatography-mass spectrometry (LC/LC/MS or 2D- 
20 LC/MS) run. 

By way of example, either size exclusion and reversed 
phase liquid chromatography (SEC-RPI.C) or strong cation 
exchange and reversed phase liquid chromatography (SCX- 
RPLC) can be used for initial separation. This is 
25 followed by mass spectrometry detection (MS) in the form 
of either electro-spray ionization (ESI) mass 
spectrometry or time-of -flight mass spectrometry. The set 
of data generated are arranged into a three dimensional 
data array, E, that contains mass intensity (count) data 
30 at different combinations of retention times (ti and t 2 , 
corresponding to the retention times in each LC 
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dimension, for example, SEC and RPHL retention times, 
digitised at m and n different time points) and masses 
(digitized at p different values covering the mass range 
of interest) . A graphical representation of this data 
5 , array is provided in Fig. 6. 

It is important to note that while the mass spectral data 
can be preprocessed into stick spectral form through 
centroiding and de-isotoping, it is not desired for this 
approach to work. Raw mass spectral continuum data can 

10 work better, due to the preservation of spectral peak 
shape information throughout the analysis and the 
elimination of all types of centroiding and de-isotoping 
errors mentioned above. A preferable approach is to 
fully calibration the continuum raw mass spectral data 

15 into calibrated continuum data to achieve high mass 
accuracy and allow for a more accurate library search. 

At each retention time combination of ti and tz in data 
array R (dimensioned as m by n by p) , the fraction of the 
sample injected into the mass spectrometer is composed of 

20 some linear combinations of a subset of the peptides in 
the original sample. This fraction of the sample is 
likely to contain somewhere between a few peptides to a 
few tens of thousands of peptides. The mass spectrum 
corresponding to such a sample fraction is likely to be 

25 very complex and, as noted above, the challenges of 
resolving such a mix into individual proteins for protein 
identification and especially quantification would seem 
to be insurmountable. 



30 



However, the three-dimensional data array, as noted above 
with respect to two-dimensional analysis, can be 
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decomposed with trilinear decomposition method based on 
eSflM (Generalised Rank Annihilation Method, d,rect 
deposition thrcugh matrix option without iteration) 
or parafao {pARAllel FACtor analysis, iterative 
5 decomposition with alternating least squares) into four 
different matrices and a residual data cube * as noted 
above. 

in this three-dimensional analysis C represents the 
■ chroma warn, with respect to t, of all identifiable 
10 peptides (k of them with ^M), 0 represents the 
-ith respfect to t 2 of all identifiable 
peptides Vt of them), w represents the deconvolved 
continuum mass spectra of all peptides (fc of them) , ana t 
is a new data array with scalars on its super -diagonal as 
15 the only nonzero elements. In other words, through the 
MMpMitUd of this data a*f*y, the two Mention times 
(tl and t2 ) have been identified for each and every 
peptide existing in the sample, along with precise 
determination of the mass ^atral eo— for each 
20 peptide contained in tfr. 

*he foregoing analyst yisid i^ormati^ oh the peptide 
X^el, unless intact proteih* are directly analyzed 
without digestion and with a mass spectrometer capable of 
handling larger masses. The protein level information, 
25 however, can be obtained from multiple samples through 
the following additional steps may be taken? 
1 Perform the 2S>-tt/m runs as described above for 
multiple samples (l of tM collected a^r a period of 
time with the same treatment, or at a fixed time with 
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different dosages of treats, or from multiple 
individuals at different disease states. 

2 Perform the data decomposition for each sample as 
described above and fully identify all the peptides with 

5 each sample . 

3 The relative concentrations of all peptides in each 
sample can be read directly from the super-diagonal 
elements in X. A new matrix B composed of these 
concentrations across all samples can be formed with 

10 dimensions of 1 samples by , distinct peptides in all 
samples (q □ U. - wh ^ ki iS th& ° f 

peptides in sample ± (i - 1, 2, -,P» . that 
do not contain some of the peptides existing in other 
samples, the entries in the corresponding rows for these 
15 peptides (arranged in columns) would be zeros. 

4 a statistical study of the matrix S will allow for 
examination of the peptides that change in proportion to 
each other from one sample to another. these peptides 
could potentially correspond to all the peptides coming 
20 from the same protein. A dendrogram based on Mahalanobis 
distance calculated from singular value decompositxon 
(SVD) or principal component analysis <1?CA> of the 8 
matrix can indicate the inter-connectedness of these 
peptides. It should be pointed out, however, that there 
25 would be groups of proteins that vary in tandem from one 
sample to another and thus all their corresponding 
peptides would be grouped into the same cluster. A 
graphical representation of this process is provided xn 
Fig. 11. 
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5. 



The matrix S so partitioned according to the grouping 
above represents the results of differential proteoses 
analysis showing the different protein expression levels 
across many samples. 

6 For all peptides in each group identified in step 6 
immediately above, the resolved mass spectral responses 
contained in W are combined to form a composite mass 
spectral signature of all peptides contained in each 
protein or group of proteins that change in tandem in 
their expression levels. Such composite mass spectrum can 
be either further processed into stick/centroid spectrum 
(if has not so processed already) or preferably searched 
directly against standard protein databases such as Mascot 
and SwissProt for protein identification using continuum 
mass spectral data as disclosed in the co-pending 
application. 

Comparing to ICAT (Gygi, 8.P. at al, mt. Bloteabnol. 17, 
994-999 (1999) ) , the quantitation proposed here does not 
require any additional sample preparation, has the 
potential of handling many thousands of samples, and uses 
all available peptides (instead of a few available for 
isotope-tagging) in an overall least squares fit to arrive 
at relative protein expression levels. Due also to the 
.nathematical isolation of all peptides and the later 
grouping back into proteins, the protein identification 
can be accomplished without peptide sequencing as is the 
case for ICKD. In the case of intact protein 2D-LC/MS 
analysis, all protein concentrations can be directly read 
off the super-diagonal in X f without any further re- 
grouping. It may however still to desirable to form the 8 
matrix as above and perform statistical analysis on the 
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matrix for the purpose of differential proteoses or 
protein expression analysis. 

in brief, the present invention provides a method for 
protein identification and protein expression analysis 
5 using three dimensional data having the following 
features : 

- the set of data generated from either of the 
following methodologies is arranged into a 3D data array: 
10 a) size exclusion and reversed phase liquxd 

chromatography (SEC-RPLC) , or 

b) strong cation exchange and reversed phase liquid 
chromatography (SCX-RPIC) , coupled with either: 

15 i) electro-spray ionization (ESI) mass spectrometry 

for peptides after protein digestion, or 
ii) time~of~flight (TOF> mass spectrometry for 
peptides or intact proteins; 

20 - here, the mass spectral data does not have to be 
preprocessed through centroiding and/or de-isotoping, 
though it is preferred to fully calibrate the raw mass 
spectral continuum; 

25 - mass spectral continuum data can be used directly and 
is in fact preferred, thus preserving spectral peak shape 
information throughout the analysis; 

this approach is a method of mathematical isolation 
30 of all peptides and then later grouping back into 
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protein*, thus the protein identification can be done 
without peptide sequencing; 

- the present invention provides a quantitative tool 
that does not require any additional sample preparation, 
has the potential o£ handling many thousands of samples, 
and uses all available peptides in an overall least 
squares fit to arrive at relative protein expression 
levels . 

The above 3-D data can come in different forms and shapes, 
to alternative to 2D-LC/MS is to perform 2D 
electrophoresis separation coupled with electrospray 
ionization (ESI) mass spectrometry (conventional ion-trap 
or quadxupole-MS or TOF-MS) . The analytical approach and 
process is identical to those described above. Other 
types of 3D data amenable to this approach include but are 
not limited to: 

2D-K with other multi-channel spectral detection by W, 
, fluorescence (with seguence-specif ic tags or tags whose 
fluorescence is affected by a segment of the protein 
sequence) , etc. 

3D electrophoresis or 3D LC with a single channel 
detection (UV at 245**, for example). *he 3D separation 
* can be applied to intact proteins to separate, for 
example, in pi, MW, and hydrophobicity. 

ID electrophoresis followed by 1D-LC/MS on either digested 
or intact proteins. 
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2D gel separation followed by MS multi-channel detection. 
If digestion is needed, it can be accomplished on the gel 
with the proper MAU>I matrix for on the gel TOF analysis. 
Other 2D means of separation coupled with multi-channel 
5 detection. 

ID separation coupled with 2D spectral detection, 
LC/MS/MS . 

ID LC or ID gel electrophoresis coupled with 2D spectral 
detection, for example, excitation-emission 2D 
10 fluorescence (EEM) - 

The methods of analysis of the present invention can be 
realized in hardware, software, or a combination of 
hardware and software. Any kind of computer system - or 
other apparatus adapted for carrying out the methods 
15 and/or functions described herein - is suitable. A 
typical combination of hardware and software could be a 
general purpose computer system with a computer program 
that, when being loaded and executed, controls the 
computer system, which in turn control an analysis 
20 system, such that the system carries out the methods 
described herein. The present invention can also be 
embedded in a computer program product, which comprises 
all the features enabling the implementation of the 
methods described herein, and which - when loaded in a 
25 computer system (which in turn control an analysis 
system), is able to carry out these methods. 

Computer program means or computer program in the present 
context include any expression, in any language, code or 
30 notation, of a set of instructions intended to cause a 
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system having an information processing capability to 
perform a particular function either directly or after 
conversion to another language, code or notation, and/or 
reproduction in a different material form. 

5 Thus the invention includes an article of manufacture 
which comprises a computer usable medium having computer 
readable program code means embodied therein for causing 
a function described above. The computer readable 
10 program code means in the article of manufacture 
comprises computer readable program code means for 
causing a computer to effect the steps of a method of 
this invention. Similarly, the present invention may be 
implemented as a computer program product comprising a 
15 computer usable medium having computer readable program 
code means embodied therein for causing a function 
described above. The computer readable program code means 
in the computer program product comprising computer 
readable program code means for causing a computer to 
20 effect one or more functions of this invention. 
Furthermore, the present invention may be implemented as 
a program storage device readable by machine, tangibly 
embodying a program of instructions executable by the 
machine to perform method steps for causing one or more 
25 functions of this invention. 

It is noted that the foregoing has outlined some of the 
m0 re pertinent objects and embodiments of the present 
invention. The concepts of this invention may be used 
30 for many applications. Thus, although the description is 
m ade for particular arrangements and methods, the intent 
and concept of the invention is suitable and applicable 
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to other arrangements, and applications. It will be clear 
to those skilled in the art that other modifications to 
the disclosed embodiments can he effected without, 
departing from the spirit and scope of the invention. 
5 The described embodiments ought to be construed to be 
merely illustrative of some of the more prominent 
features and applications of the indention. Thus,- it- 
should he understood that the foregoing description is 
only illustrative of the invention. Various alternatives 
10 and modifications can be devised by those skilled in the 
ar.t without departing from the invention- Other 
beneficial results can be realized by applying the 
disclosed invention in a different manner or modifying 
the invention in ways known to those familiar with the 
15 art, Thus, it should be understood that the embodiments 
has been provided as an example and not as a limitation. 
Accordingly, the present invention is intended to embrace 
all alternatives, modifications and variances which fall 
within the scope of the appended claims. 
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CLAIMS 



What is claimed is: 

1. A method for analyzing data obtained from 
multiple samples in a separation system that has a 
capability for separating components of a sample 
containing more than one component as a function of two 
different variables, said method comprising: 

obtaining data representative of multiple samples 
from said system, said data being expressed as a function 
of said two variables; 

forming a data stack having successive levels, each 
level containing one of said data samples; 

forming a data array representative of a compilation 
of all of the data in said data stack; and 

separating said data array into a series of 
matrixes, said matrixes being: 

a concentration matrix representative of 
concentration of each component in said sample; 

a first profile of the components as a function 
of said first variable; and 

a second profile of the components as a 
function of said second variable. 

2. The method of claim 1, wherein said first profile 
and said second profile are representative of profiles of 
substantially pure components . 
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3. The method of claim 1, further comprising 
performing qnUtrthi analysis using at toast one of 

said first profile and said second profile. 

4. The method of cla±m 1, further ^prising 
standardising data representative of a sample by 
performing a data matrix multiplication of such data into 
the product of a first standardization matrix, the data 
itself, and a second standardization matrix, to form a 
standardized data matrix. 

5. The method of claim 4, wherein terms in said 
first standardization matrix and said second 
standardization matrix have values that cause said data 
to be represented at positions with respect to said two 
variables, which are different in said standardized data 
matrix from those in said data array. 

6. The method of claim 5, wherein said first 
standardization matrix shifts said data with respect to 
said first variable, and said second standardization 
matrix shifts said data with respect to said second 
variable . 

7. The method of claim 5, wherein terms in said 
first standardization matrix and said second 
standardization matrix have values that serve to 
standardize , distribution shapes of the data with respect 
to said first and second variable, respectively. 

8. The method of claim 4, wherein terms in said 
first standardization matrix and said second 
standardization matrix are determined by: 
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applying a sample having known components to said 
apparatus; and 

selecting terms for said first standardization 
matrix and said second standardisation matrix which cause 
data produced by said known components to be positioned 
properly with respect to said first variable and saxd 
second variable. 

9 The method of claim 8, wherein said terms are 
determined by selecting terms which prodnce a smallest 
error in position or said data with respect to said first 
triable and said second variable in said standardxsed 
data matrix. 

10 The method of claim 9, wherein the terms of said 
firs t standardization matrix and said second 
standardization matrix are computed for each sample. 

11 The method of claim 10, wherein terms of said 
first standardization matrix and said second 
standardization matrix are computed so as to produce a 
smallest error over all samples. 

12 The method of claim 4, wherein at least one of 
the first and second standardization matrices can be 
simplified to be either a diagonal matrix or an identity 
matrix. 

13 The method of claim 4, wherein the terms in said 
£irst standardization matrix and said second 
standardization matrix are based on parameterized known 
functional dependence of said terms on said variables, 

14 The method of claim 8, wherein values of terms 
±n said first standardization matrix and said second 
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standardization matrix are determined by solving said 
data array R: 




P 



W here Q (m x k) contains pure profiles of all ft components 
with respect to the first variable, W (a * W contains 
pure profiles with respect to the second variable for the 
components, C <p * « contains concentrations of these 
components in all p samples, X is a new data array with 
scalar* on its super-diagonal as the only nonzero 
elements, and E {m x n x p) is a residual data array. 

15. The method of clarua 1, wherein said apparatus is 
a two-dimensional electrophoresis separation system. 
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16. The method of claim 15, wherein said first 
variable is isoelectric point and said second variable is 
molecular weight. 

17. The method of claim 1, wherein said variables 
are a result of any combination, in no particular 
sequence, and including self -combination, of 
chromatographic separation, capillary electrophoresis 
separation, gel-based separation, affinity separation and 
antibody separation. 

18. The method of claim 1, wherein one of the two 
variables is mass associated with the mass axis of a mass 
spectrometer . 

19. The method of claim 18, wherein said apparatus 
further comprises a chromatography system for providing 
said samples to said mass spectrometer, retention time 
being another of the two variables. 

20. The method of claim 18, wherein said apparatus 
further comprises an electrophoresis separation system 
for providing said samples to said mass spectrometer, 
migration characteristics of said sample being another of 
the two variables. 

21. The method of claim 18, wherein said data is 
continuum mass spectral data. 

22. The method of claim 18, wherein said data is 
used without centroiding. 

23. The method of claim 18, further comprising 
correcting said data for time skew. 
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24. The method of claim 18, further comprising 
performing a calibration of said data with respect to 
mass and mass spectral peak shapes. 

25. The method of claim 18, wherein the other one of 
said first variable and said second variable is that of a 
region on a protein chip having a plurality of protein 
affinity regions. 

26. The method of claim 1, further comprising: 

obtaining data for said data array by using a single 
channel analyzer and by analyzing the samples 
successively. 

27. The method of claim 26, wherein said single 
channel detector is based on one of light absorption, 
light emission, light reflection, light transmission, 
light scattering, refractive index, electrochemistry, 
conductivity, radioactivity, or any combination thereof. 

28. The method of claim 27, wherein the components 
in said sample are bound to at least one of fluorescence 
tags, isotope tags, stains, affinity tags, or antibody 
tags. 



29. A computer readable medium having thereon 
computer readable code for use with a chemical analysis 
system having a data analysis portion for analyzing data 
obtained from multiple samples, said chemical analysis 
system having a separation portion that has a capability 
for separating components of a sample containing more 
than one component as a function of two different 
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variables, said computer readable code being for causing 
the computer to perform a method comprising: 

obtaining data representative of multiple samples 
from said system, said data being expressed as a function 
of said two variables; 

forming a data stack having successive levels, each 
level containing one of said data samples ; 

forming a data array representative of a compilation 
of all of the data in said data stack; and 

separating said data array into a series of 
matrixes, said matrixes being: 

a concentration matrix representative of 
concentration of each component in said sample; 

a first profile of the components as a function 
of said first variable; and 

a second profile of the components as a 
function of said second variable. 

30 The computer readable medium of claim 29, 
further comprising computer readable code for causing 
said computer to analyse data by performing the steps of 
any one of claims 2-28. 

31 A chemical analysis system for analysing data 
obtained from multiple samples, said system having a 
separation system that has a capability for separating 
components of a sample containing more than one component 
as a function of two different variables, said system 
having apparatus for performing a method comprising: 
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obtaining data representative of multiple samples 
from said system, said data being expressed as a faction 
of said two variables; 

forming a data stack having successive levels, each 
level containing one of said data samples; 

forming a data array representative of a compilation 
of all of the data in said data stack; and 

separating said data array into a series of 
matrixes, said matrixes being: 

a concentration matrix representative of 
concentration of each component in said sample; 

a first profile of the components as a function 
of said first variable; and 

a second profile of the components as a 
function of said second variable. 

32 The chemical analysis system of claim 31, 
wherein' said method further comprises the steps of any 
one of claims 2-28. 



33 A method for analyzing data obtained from a 
sample in a separation system that has a capability for 
separating components of a sample containing more than 
one component, said method comprising: 

separating said sample with respect to at least a 
first variable to form a separated sample ; 
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separating said separated sample with respect to at 
least a second variable to form a further separated 
sample; 

obtaining data representative of said further 
separated sample from a multi-channel analyser, said data 
being expressed as a faction of three variables; 

forming a data stack having successive levels, each 
Xevel containing data from one channel of said multi- 
channel analyser; 

forming a data array representative of a compilation 
of all of the data in said data stack; and 

separating said data array into a series of matrixes 
or arrays, said matrices or arrays being: 

a concentration data array representative of 
concentration of each component in said sample on 
its super-diagonal; 

a first profile of each component as a function of a 
first variable; 

a second profile of each component as a function of 
a second variable; and 

a third profile of each component as a function of a 
third variable. 

34 The method of claim 33, therein said first 
profile, said second profile, and said third profile are 
representative of profiles of substantially pure 
components . 
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35. The method of claim 33, further comprising 
performing qualitative analysis using at least one of 
said first profile, said second profile, and said third 
profile . 

36. The method of claim 33, further comprising 
standardizing data representative of a sample by 
performing a data matrix multiplication of such data into 
the product of a first standardisation matrix \ the data 
itself, and a second standardisation matrix, to form a 
standardized data matrix. 

37. The method of claim 36, wherein terms in said 
first standardization matrix and said second 
standardization matrix have , values that cause said data 
to be represented at positions with respect to two of 
said three variables, which are different in said 
standardized data matrix from those in said data array. 

38. The method of claim 37, wherein said first 
standardization matrix shifts said data with respect to 
one of said two variables, and said second 
standardization matrix shifts said data with respect to 
the other of said two variables. 

39. The method of claim 37, wherein terms in said 
first standardization matrix and said second 
standardization matrix have values that serve to 
standardize distribution shapes of the data with respect 
to said two variables, respectively. 

40. The method of claim 36, wherein terms in said 
first standardization matrix and said second 
standardization matrix are determined by: 
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applying a sample having known components to said 
apparatus; and. 

selecting terms for said first standardization 
ffia tri* and said second standardisation matrix which cause 
data produced by said known components to be positioned 
properly with respect to the two variables. 

41. The method of claim 40, wherein said terms are 
determined by selecting terms which produce a smallest 
error in position of said data with respect to the- two 
variables, in said standardized data matrix. 

42. The method of claim 41, wherein the terms of 
said first standardization matrix and said second 
standardization matrix are computed for a single channel. 

43. The method of clai** 42, wherein terms of said 
first standardization matrix and said second 
standardization matrix are computed so as to produce a 
smallest error for the channel. 

44. The method of claim 36, wherein at least one of 
the first and second standardization matrices can be 
simplified to be either a diagonal matrix or an identity 
xaatrix. 

45. The method of claim 36, wherein the terms in 
said first standardization matrix and said second 
standardization matrix are based on parameterized known 
functional dependence of said terms on said variables. 

46. The method of claim 41, wherein values of terms 
in said first standardization matrix and in said second 
standardization matrix are determined by solving data 
array R: 
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Q (m x k) contains pur© profiles of all k components 
with respect to the first variable, W (n x k) contains 
pure profiles with respect to the second variable for the 
components, C <p x k) contains pure profiles of these 
components with respect to the multichannel analyzer or 
the third variable, I (k x fc x k) is a new data array 
with scalars on its super-diagonal as the only nonzero 
elements representing the concentrations of all said k 
components, and E (m x n x p) is a residual data array. 

47. The method of claim 33, wherein one of said 
separation apparatus is a one-dimensional electrophoresis 
separation system. 

48. The method of claim 47, wherein said variable is 
one of isoelectric point and molecular weight. 

49. The method of claim 33, wherein said two 
separation variables are a result of any combination, in 
no particular sequence, and including self- combination, 
of chromatographic separation, capillary electrophoresis 
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separation, gel-based separation, affinity separation and 
antibody separation 

50. The method of claim 33, wherein one of the three 
variables is mass associated with the mass axis of a mass 
spectrometer. 

51. The method of claim 50, wherein said apparatus 
father comprises at least one chromatography system for 
providing said separated samples to said mass 
spectrometer, retention time being at least one of the 
variables . 

52. The method of claim 50, wherein said apparatus 
further comprises at least one electrophoresis separation 
system for providing said separated samples to said mass 
spectrometer, migration characteristics of said sample 
being at least one of the variables. 

53. The method of claim 50, wherein said data is 
continuum mass spectral data. 

54. The method of claim 50, wherein said data is 
used without centroiding. 

55. The method of claim 50, further comprising 
correcting said data for time skew. 

56. The method of claim 50, further comprising 
performing a calibration of said data with respect to 
mass and spectral peak shapes. 

57. The method of claim 50, wherein said apparatus 
comprises a protein chip having a plurality of protein 
affinity regions, location of a region being one of sa*d 
three variables. 
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58. The method of claim 33, wherein said 
multichannel analyzer is based on one of light 
absorption, light emission, light reflection, light 
transmission, light scattering, refractive index, 
electrochemistry, conductivity, radioactivity, or any 
combination thereof. 

59. The method of claim 58, wherein the components 
in said sample are bound to at least one of fluorescence 
tags, isotope tags, stains, affinity tags, or antibody 
tags . 

60. The method of claim 33, wherein said apparatus 
comprises a two-dimensional electrophoresis separation 
system. 

61. The method of claim 60, wherein a first of said 
at least one variable is isoelectric point and a second 
of said at least one variable is molecular weight. 



62. A computer readable medium having thereon 
computer readable code for use with a chemical analysis 
system having a data analysis portion for analyzing data 
obtained from a sample, said chemical analysis system 
having a separation portion that has a capability for 
separating components of a sample containing more than 
one component as a function of at least one variable, 
said computer readable code being for causing the 
computer to perform a method comprising: 

separating said sample with respect to at least a 
first variable to form a separated sample; 
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separating said separated sample with respect to at 
least a second variable to for* a further separated 
sample; 

obtaining data representative of said further 
separated sample from a multi-channel analyzer, said data 
being expressed as a function of three variables; 

forming a data stack having successive levels, each 
level containing data from one channel of said «lti- 
channel analyzer; 

forming a data array representative of a compilation 
of all of the data in said data stack; and 

separating said data array into a series of matrixes 
or arrays, said matrixes or arrays being: 

a concentration data array representative of 
concentration of each component in said sample on 
its super-diagonal; 

a first profile of each component as a function of a 
first variable; 

a second profile of each component as a function of 
a second variable; and 

a third profile of each component as a function of a 
third variable. 

S3 The computer readable medium of claim 62, 
further comprising computer readable code for causing 
said computer to analyze data by performing the steps of 
any one of claims 34-61. 
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64. & chemical analysis system for analyzing data 
obtained from a sample, said system having a separation 
system that has a capability for separating components of 
a sample containing more than one component as a function 
of at least one variable, said system having apparatus 
for performing a method comprising: 

separating said sample with respect to at least a 
first variable to form a separated sample; 

separating said separated sample with respect to at 
least a second variable to form a further separated 
sample; 

obtaining data representative of said further 
separated sample from a multi-channel analyzer, said data 
being expressed as a function of three variables; 

forming a data stack having successive levels, each 
level containing data from one channel of said multi- 
channel analyser; 

forming a data array representative of a compilation 
of all of the data in said data stack; and 

separating said data array into a series of matrixes 
or arrays, said matrixes or arrays being: 

a concentration data array representative of 

concentration of each component in said sample on 

its super-diagonal; 

a first profile of each component as a function of a 
first variable; 

a second profile of each component as a function of 
a second Variable; and 
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a third profile of each component as a function of a 
third variable. 

65. The chemical analysis system of claim 64, 
wherein said method further comprises the steps of any on 
of claims 34 - 61. 

66. & method for analyzing data obtained from at 
least one sample in a separation system that has a 
capability for separating components of a sample 
containing more than one component as a function of at 
least two different variables , said method comprising: 

obtaining data representative of said at least one 
sample from said system, said data being expressed as a 
function of said two variables ; 

forming a data stack having successive levels, each 
level containing successive data representative of said 
at least one sample ; 

forming a data array representative of a compilation 
of all of the data in said data stack; and 

separating said data array into a series of 
matrixes, said matrixes being: 

a concentration matrix representative of 
concentration of each component in said sample; 

a first profile of the components as a function 
of a first of said variables; and 

a second profile of the components as a 
function of a second of said variables. 
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67. The method of claim 66, wherein said at least one 
sample comprises a single sample, and said successive data 
is representative of* said sample as a function of time. 

68. The method of claim 66 , wherein said at least one 
sample comprises a single sample, and said successive data 
is representative of said sample as a function of mass of 
its components. 

69. The method of claim 68, wherein said at least one 
sample comprises a plurality of samples, and said 
successive data ie representative of successive samples. 
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